PROBABILITY PROPORTIONAL TO SIZE WITHOUT REPLACEMENT -2148149
2148149--VARNIKA V.N
PROBABILITY PROPORTIONAL TO SIZE WITHOUT REPLACEMENT
What is PPS?
In PPS (Probability Proportional to size) Sampling, the probability of selection of element is proportional to its size. This improves the accuracy of population estimates because the larger elements have the greatest impact on the population estimates.
In order to ensure that all units (ex. individuals) in the population have the same probability of selection irrespective of the size of their cluster, each of the hierarchical levels prior to the ultimate level has to be sampled according to the size of ultimate units it contains, but the same number of units has to be sampled from each cluster at the last hierarchical level. This method also facilitates planning for fieldwork because a pre-determined number of individuals is interviewed in each unit selected, and staff can be allocated accordingly It is most useful when the sampling units vary considerably in size because it assures that those in larger sites have the same probability of getting into the sample as those in smaller sites, and vice verse.
The basic difference between simple random sampling and pps sampling procedure is in simple random sampling, the probability of drawing any specified unit at any given draw is the same while in PPS sampling it differs from draw to draw.
Let us take this example of Wikipedia which illustrates clearly the concept of PPS Sampling,
Suppose we have six schools with a population of 150,180,200,220,260 and 490 students respectively. Total 1500 students and we want to use student population as the basis for a PPS sample of size 3. To this we will allocate the first school numbers 1 to 150, the second school 151 to 330(=150+180), the third school 331to 530, and soon to the last school (10111to 1500). We then generate a random start between 1 and 500(equal to 1500/3)and count through the school populations by multiples of 500. If our random start was 137, we would select the schools which have been allocated numbers 137,637, and 1137 i.e.the first, fourth and sixth schools.
In PPS there are 2 ways in which we can draw samples they are,
- PPS with replacement
- PPS without replacement.
In this blog, we will look into the concept of probability proportional to size precisely without replacement technique and sampling without replacement is more efficient than with replacement technique.
PPSWOR is divided into ordered and unordered estimators for better clarity.
Ordered estimators:
- Des-Raj's ordered, estimator
Unordered estimators:
- Horvitz Thompson Estimator
- Murthy's unordered Estimator
Ordered Estimators:
APPLICATION OF DESRAJ ESTIMATOR:
The application of desraj estimator is demonstrated on the data-set which shows the area Under lime (in acres) and No. of bearing lime trees in each of the 22 villages growing lime in one of the tehsils of Bangalore district. The procedure includes drawing the sample of size 5 and with ppswor scheme and estimate the total number of bearing lime trees using ordered Desraj estimator, then also give the bound on the error of estimation.
#ANALYSIS:
#loading the package
library(fpest)
library(readxl)
tree_data <- read_excel("C:/Users/DELL/Downloads/tree_data.xlsx")
View(tree_data)
attach(tree_data)
#Now wew obtain sum of area under lime
S<-sum(`Area Under lime(in acres)`)
S
## [1] 497.66
#Now, we have to find the probability
of the values of the area under lime(in acres)
pi<-(`Area Under lime(in acres)`)/S
pi
## [1]
0.0658481694 0.0160149500 0.0012458305 0.0313667966 0.0861029619
## [6]
0.0804364426 0.0188683037 0.0127195274 0.0101474903 0.1899891492
## [11] 0.1079250894 0.0013463007 0.0016477113
0.0043202186 0.0008640437
## [16] 0.2478800788 0.0005827272 0.0060282120
0.0080376160 0.0040188080
## [21] 0.0124783989 0.0921311739
Now we
have to combine the data frame and their respective probabilities.
Data_1<-cbind(tree_data,pi)
Data_1
## S.No. of villages Area Under lime(in acres)
No. of bearing lime trees
## 1
1 32.77 2328
## 2
2 7.97 754
## 3
3 0.62 105
## 4
4 15.61 949
## 5
5 42.85 3091
## 6
6 40.03 1736
## 7 7 9.39 840
## 8
8 6.33 311
## 9
9 5.05 0
## 10
10 94.55 3044
## 11
11 53.71 2483
## 12
12 0.67 128
## 13
13 0.82 102
## 14
14 2.15 60
## 15
15 0.43 0
## 16
16 123.36 11799
## 17
17 0.29 26
## 18
18 3.00 317
## 19
19 4.00 190
## 20
20 2.00 180
## 21
21 6.21 752
## 22
22 45.85 3091
##
pi
## 1
0.0658481694
## 2
0.0160149500
## 3
0.0012458305
## 4
0.0313667966
## 5
0.0861029619
## 6
0.0804364426
## 7
0.0188683037
## 8
0.0127195274
## 9
0.0101474903
## 10 0.1899891492
## 11 0.1079250894
## 12 0.0013463007
## 13 0.0016477113
## 14 0.0043202186
## 15 0.0008640437
## 16 0.2478800788
## 17 0.0005827272
## 18 0.0060282120
## 19 0.0080376160
## 20 0.0040188080
## 21 0.0124783989
## 22 0.0921311739
#Now we have to obtain a sample using the methodof PPSWOR #FIRT,we
have to import the fpest library
set.seed(123)
PPS_WOR<-Data_1[sample(1:nrow(Data_1),5,replace =
F),]
PPS_WOR
## S.No. of villages Area Under lime(in acres)
No. of bearing lime trees
## 15
15 0.43 0
## 19
19 4.00 190
## 14
14 2.15 60
## 3
3 0.62 105
## 10
10 94.55 3044
##
pi
## 15 0.0008640437
## 19 0.0080376160
## 14 0.0043202186
## 3
0.0012458305
## 10 0.1899891492
desraj(PPS_WOR$`No. of
bearing lime trees`,PPS_WOR$pi)
##
$est
## [1] 27426.98
##
## $estvar
## [1] 210519342
##
## $tvals
## [1] 0.00
23618.43 13954.56 83416.77 16145.17
#Hence,the estimate of the total number of bearing lime trees is
27426.98.
#Now we have to find the bound on Error of Estimation.
variance<-210519342
Standard_Error<-sqrt(variance)
Standard_Error
##
[1] 14509.28
Bound_Error<-2*Standard_Error
Bound_Error
##
[1] 29018.57
CONCLUSION:
The analysis shows that the estimation of the total no. of bearing lime trees in the population is 27427 trees with the variance on the estimated value as 210519342. The bound-on error estimate is 29018.57. So, we can say that almost about 27427 total no. of trees are in 22 villages growing lime in one of the Bangalore district's tehsils.
UNORDERED ESTIMATORS:
In the ordered estimator, the order in which the units are drawn is considered. Corresponding to any ordered estimator, there exists an unordered estimator which does not depend on the order in which the units are drawn and has a smaller variance than the ordered estimator.
In case of sampling WOR from a population of size N, there are Ncn unordered sample(s) of size n.
Corresponding to an unordered sample(s) of size n units, there are n! ordered samples. For example, for n =2 if the units are u1 and u2 , then
- there are 2! ordered samples - (u1,u2) and (u2,u1).
- there is one unordered sample (u1,u2)
- Horvitz-Thompson Estimator:
library(readxl)
tree_data <- read_excel("C:/Users/DELL/Downloads/tree_data.xlsx")
View(tree_data)
data<-tree_data
attach(data)
head(data)
## # A tibble: 6 x 3
## `S.No. of
villages` `Area Under lime(in acres)` `No. of bearing lime trees`
##
<dbl>
<dbl>
<dbl>
## 1
1 32.8 2328
## 2 2 7.97 754
## 3
3 0.62 105
## 4
4 15.6 949
## 5
5 42.8 3091
## 6
6 40.0 1736
library(samplingbook)
## Loading required package: pps
## Loading required package: sampling
## Loading required package: survey
## Loading required package: grid
## Loading required package: Matrix
## Loading required package: survival
##
## Attaching package: 'survival'
## The following objects are masked from
'package:sampling':
##
## cluster,
strata
##
## Attaching package: 'survey'
## The following object is masked from
'package:graphics':
##
## dotchart
#Horvitz-Thompson
Method
set.seed(123)
pps=pps.sampling(data$`No. of
bearing lime trees`,n=8,method="midzuno")
pps
##
## pps.sampling object: Sample with probabilities
proportional to size
## Method of Midzuno:
##
## PPS sample:
## [1] 1 5 6 10
11 16 20 22
##
## Sample probabilities:
##
[,1] [,2] [,3]
[,4] [,5] [,6]
## [1,] 0.81249455 0.81249455 0.4268213 0.81249455
0.67908559 0.81249455
## [2,] 0.81249455 1.00000000 0.6058808 1.00000000
0.86659105 1.00000000
## [3,] 0.42682130 0.60588081 0.6058808 0.60588081
0.47848110 0.60588081
## [4,] 0.81249455 1.00000000 0.6058808 1.00000000
0.86659105 1.00000000
## [5,] 0.67908559 0.86659105 0.4784811 0.86659105
0.86659105 0.86659105
## [6,] 0.81249455 1.00000000 0.6058808 1.00000000
0.86659105 1.00000000
## [7,] 0.04959955 0.06282174 0.0350299 0.06282174
0.05341423 0.06282174
## [8,] 0.81249455 1.00000000 0.6058808 1.00000000
0.86659105 1.00000000
##
[,7] [,8]
## [1,] 0.04959955 0.81249455
## [2,] 0.06282174 1.00000000
## [3,] 0.03502990 0.60588081
## [4,] 0.06282174 1.00000000
## [5,] 0.05341423 0.86659105
## [6,] 0.06282174 1.00000000
## [7,] 0.06282174 0.06282174
## [8,] 0.06282174 1.00000000
sample=data[pps$sample, ]
sample #selected sample using midzuno method
## # A tibble: 8 x 3
## `S.No. of
villages` `Area Under lime(in acres)` `No. of bearing lime trees`
## <dbl> <dbl> <dbl>
## 1
1 32.8 2328
## 2
5 42.8 3091
## 3
6 40.0 1736
## 4
10
94.6 3044
## 5
11
53.7 2483
## 6
16 123. 11799
## 7
20 2 180
## 8
22
45.8 3091
N=nrow(data)
N
## [1] 22
popmean=mean(data$`No. of bearing
lime trees`)
PI=pps$PI
head(PI)
##
[,1] [,2] [,3]
[,4] [,5] [,6]
[,7]
## [1,] 0.8124945 0.8124945 0.4268213 0.8124945
0.6790856 0.8124945 0.04959955
## [2,] 0.8124945 1.0000000 0.6058808 1.0000000 0.8665910
1.0000000 0.06282174
## [3,] 0.4268213 0.6058808 0.6058808 0.6058808
0.4784811 0.6058808 0.03502990
## [4,] 0.8124945 1.0000000 0.6058808 1.0000000
0.8665910 1.0000000 0.06282174
## [5,] 0.6790856 0.8665910 0.4784811 0.8665910
0.8665910 0.8665910 0.05341423
## [6,] 0.8124945 1.0000000 0.6058808 1.0000000
0.8665910 1.0000000 0.06282174
##
[,8]
## [1,] 0.8124945
## [2,] 1.0000000
## [3,] 0.6058808
## [4,] 1.0000000
## [5,] 0.8665910
## [6,] 1.0000000
PI=as.matrix(PI)
#Estimation of variance using HT
Method.
var_ht=htestimate(sample$`No. of
bearing lime trees`,N=N,PI=PI,method='ht')
var_ht
##
## htestimate object: Estimator for samples with
probabilities proportional to size
## Method of Horvitz-Thompson:
##
## Mean estimator: 1476.636
## Standard Error: 117.0282
#pik is the 1st
order inclusion prob for all population units
po_tota=popmean*N
po_tota
## [1] 32486
INTERPRETATION:
Using the Horvitz-Thompson method, the estimated average number of bearing lime trees is 1476.636 with a standard error of 117.0282.
Ex2:
In this example below we calculated the Horvitz-Thompson estimate with different methods for variance estimation such as Yates and Grundy, Hansen-Hurwitz, and Hajek for dataset influenza.
input:
data(influenza)
summary(influenza)
# pps.sampling()
set.seed(108506)
pps <- pps.sampling(z=influenza$population,n=20,method='midzuno')
sample <- influenza[pps$sample,]
# htestimate()
N <- nrow(influenza)
# exact variance estimate
PI <- pps$PI
htestimate(sample$cases, N=N, PI=PI, method='yg')
htestimate(sample$cases, N=N, PI=PI, method='ht')
# approximate variance estimate
pk <- pps$pik[pps$sample]
htestimate(sample$cases, N=N, pk=pk, method='hh')
pik <- pps$pik
htestimate(sample$cases, N=N, pk=pk, pik=pik, method='ha')
# without pik just approximate calculation of Hajek method
htestimate(sample$cases, N=N, pk=pk, method='ha')
# calculate confidence interval based on normal distribution for number of cases
est.ht <- htestimate(sample$cases, N=N, PI=PI, method='ht')
est.ht$mean*N
lower <- est.ht$mean*N - qnorm(0.975)*N*est.ht$se
upper <- est.ht$mean*N + qnorm(0.975)*N*est.ht$se
c(lower,upper)
# true number of influenza cases
sum(influenza$cases)
output:Loading required package: pps
Loading required package: sampling
Loading required package: survey
Loading required package: grid
Loading required package: Matrix
Loading required package: survival
Attaching package: 'survival'
The following objects are masked from 'package:sampling':
cluster, strata
Attaching package: 'survey'
The following object is masked from 'package:graphics':
dotchart
id district population cases
Min. : 1001 LK Aachen : 1 Min. : 34719 Min. : 0.00
1st Qu.: 5877 LK Ahrweiler : 1 1st Qu.: 104553 1st Qu.: 9.00
Median : 8331 LK Aichach-Friedberg: 1 Median : 145130 Median : 27.00
Mean : 8468 LK Alb-Donau-Kreis : 1 Mean : 193910 Mean : 44.58
3rd Qu.: 9778 LK Altenburger Land : 1 3rd Qu.: 244154 3rd Qu.: 59.00
Max. :16077 LK Altenkirchen : 1 Max. :1770629 Max. :410.00
(Other) :418
htestimate object: Estimator for samples with probabilities proportional to size
Method of Yates and Grundy:
Mean estimator: 40.36766
Standard Error: 8.059507
htestimate object: Estimator for samples with probabilities proportional to size
Method of Horvitz-Thompson:
Mean estimator: 40.36766
Standard Error: 8.227719
htestimate object: Estimator for samples with probabilities proportional to size
Method of Hansen-Hurwitz (approximate variance):
Mean estimator: 40.36766
Standard Error: 8.534792
htestimate object: Estimator for samples with probabilities proportional to size
Method of Hajek (approximate variance):
Mean estimator: 40.36766
Standard Error: 8.262482
htestimate object: Estimator for samples with probabilities proportional to size
Method of Hajek (approximate variance):
Mean estimator: 40.36766
Standard Error: 8.244296
Conclusion:we can say the least biased and mostaccurate method to be used is Hajek method followed by Yates-Grundy,Horvitz-Thompson and Hansen Hurwitz> in that order.Murthy’s unordered estimator corresponding to Des Raj’s ordered estimator for the
sample size 2Suppose yi and yj are the values of units Ui and Uj selected in the first and second draws respectively with varying probability and WOR in a sample of size 2 and let pi and pj be the corresponding initial probabilities of selection. So now we have two ordered estimates corresponding to the ordered samples s1* and s2* as follows
s1*=(yi,yj) and s2*=(Ui,Uj)
Thus we can conclude that Unordered sampling is more efficient than ordered sampling.



Comments
Post a Comment