PROBABILITY PROPORTIONAL TO SIZE SAMPLING
Probability Proportional to Size Sampling
KRUPA JINSA JIGY
2148152
DEPARTMENT OF STATISTICS
CHRIST DEEMED TO BE UNIVERSITY
Introduction
The method of Simple Random Sampling for selection of sample from a population is useful when the units do not vary much in size. In a village if the cultivator fields do not vary much in size the use of SRS is advantageous. In real life situations the units vary considerably in size.
All the villages in a district may not be of same size. Some may be very small, some large and some very large. In such situation SRS will not be able to make a distinction between them and all units will have the same probability of selection. An ideal situation would be to assign probabilities proportional to their size. The larger units are expected to make greater contribution to the population total.
One way of increasing the efficiency of the estimates is to assign unequal probabilities of selection to the different units in the population.
More specifically the selection probabilities can be made directly proportional to the total area or total crop area of units.
A procedure of sampling in which units are selected with probabilities proportional to their size or pps sampling. This mechanism is called Sampling with probability proportional to size with replacement. This is simpler to handle as compared to sampling without replacement.
Selecting a sample with PPS with replacement
Let there be N units in the population and let x1,x2,……..,xN be the corresponding sizes of the units. The sizes are proportional to the probabilities assigned to the N units in the population. Here the natural numbers 1 to x1 are associated with the first unit, x1+1 to x1+ x2 with the second unit and so on. We draw a number at random from 1 to SN = ∑ xi ; i=1 to N say R, and select that i-th unit in the population for which x1+x2+……..+xi-1<R≤ x1+x2+……..+xi-1+ xi , where x0 is to be interpreted as zero. It is evident that this procedure of selection gives to the i-th unit in the population a probability of selection proportional to xi . The procedure is to be repeated n times if a sample of size n is required.
An example of selection of units by probability proportional to size with replacement mechanism is given below for illustration purpose.
A village has 10 orchards containing 150, 50, 80, 100, 200, 160, 40, 220, 60 and 140 trees respectively. Select a sample of 4 orchards with replacement and with probability proportional to the number of trees in the orchard.
The total number in all the 10 orchards in the village is 1200. The first step in the selection of orchards is to form successive cumulative totals as shown below.
From the table of random numbers a draw is made in such a way that the selected random number does not exceed 1200. Let the selected random number be 600. It can be easily seen from the successive cumulative totals that this is one of the numbers from 581 to 740 associated with the 6th orchard. The 6th orchard is therefore, selected corresponding to the random number 600. Next, another random number is drawn in the same way as earlier, it is matched with the class of successive cumulative totals and the corresponding orchard is selected. Likewise two more random numbers are drawn and the procedure is repeated. Let the 3 selected numbers be 650, 850 and 300. Then the orchards selected corresponding to these random number are the 6th, 8th and 4th respectively. It may be seen that the 6th orchard is selected twice.
The main drawback of this procedure is that one has to write down the successive cumulative totals. When the number of units in the population is large, the procedure becomes time consuming and tedious. Lahiri has suggested an alternative procedure which does not require writing down cumulative totals.
Estimation in PPS Sampling with replacement: Total and its sampling variance
Consider a population of N units and let yi be the value of the characteristic under study for the unit ui of the population (i = 1,....,N). Suppose further that pi= Xi /X be the probability that the unit ui is selected in a sample of one such that ∑ pi = 1, i = 1 to N. Let N independent selections be made with the replacement method and the value of yi for each selected unit be observed. Further, let (yi,, pi ) be the value and probability of selection of the i-th unit of the sample. It can be seen that the random variates yi / pi (i = 1,2,….,n) are independently and identically distributed. If pi= 1/N, it gives rise to a simple random sample. This shows that simple random sampling is a particular case of pps sampling.
Theorem:
In PPS sampling, wr, an unbiased estimator of the population total Y is given by
with its sampling variance
Proof:
Let us define random variates zi = yi / pi (i = 1,...,n) which are independently and identically distributed. Hence,
Now let us consider
Since,
YPPS is an unbiased estimator.To obtain the sampling variance of YPPS , we have
the covariance terms Cov ( yi / pi, yj / pj), i≠j being zero.
This shows that the variance of the estimator is inversely proportional to the sample size n as in simple random sampling, wr.
Corollary:
An unbiased estimator of the population mean, Y is given by
with its sampling variance
Theorem:
In PPS sampling, wr, an unbiased estimator of V(YPPS) is given by
Proof:
In PPS sampling, wr, zi = yi / pi (i = 1,2,….,n) are n independent unbiased estimators of Y having the same variance.
It can be shown easily that s2 is an unbiased estimator of S2 (with the usual meaning) and hence the theorem can be proved.Corollary:
An unbiased estimator of V(YPPS) is given by
This equation is mainly used for computational purposes.Pros of PPS Sampling
1) It is convenient and easy to use.
2) It selects a sample that is extremely symbolic of the population.
3) It stratifies the population to easily select the most appropriate sample in the population.
4) It helps in creating samples without the requirement of a random number generator.
Cons of PPS Sampling
1) It does not work if the units in the population are heterogeneous.
2) It is time-consuming and monotonous.
3) It's not simple as random sampling.
Analysis
R programming
Farm loans analysis in the US for the year 2007 using PPSWR
Aim:
To draw
a sample of size 8 using ppswr sampling technique and collect the required
information. Estimate the relative efficiency of ppswr sampling estimator
forestimating the total amount of the real estate farm loans during 2007 by
using information on the nonreal estate farm loans during 2007 with respect to
the ratio estimator of population total.
DATA DESCRIPTION
The data describes the non real estate farm loans (x) and real
estate farm loans (y) of 50 states and union territories. Tha variables f interest
are x and y where x is the independent varaible ( contains auxillary
information) and y is the dependent varaible on x.
library(readxl)
loan <- read_excel("C:/Users/KRUPA JINSA
JIGY/OneDrive/Desktop/Christ University/loan.xlsx")
View(loan)
head(loan)
##
# A tibble: 6 x 4
## Sl.No
`State and Territory` `Nonreal estate farm loans` `Real estate farm loa~
## <dbl>
<chr>
<dbl>
<dbl>
## 1 1
AL
348. 409.
## 2 2
AK
3.43 2.60
## 3 3
AZ
431. 54.6
## 4 4
AR
848. 908.
## 5 5
CA
3929. 1343.
## 6 6
CO
906. 316.
library(samplingbook)
##
Loading required package: pps
##
Loading required package: sampling
##
Loading required package: survey
##
Loading required package: grid
##
Loading required package: Matrix
##
Loading required package: survival
##
## Attaching package: 'survival'
##
The following objects are masked from 'package:sampling':
##
## cluster,
strata
##
## Attaching package: 'survey'
##
The following object is masked from 'package:graphics':
##
## dotchart
set.seed(123)
sample=ppswr(loan$Sl.No,8)
sample
##
[1] 27 45 32 47 49 11 37 48
pps=loan[sample, ]
pps
##
# A tibble: 8 x 4
## Sl.No
`State and Territory` `Nonreal estate farm loans` `Real estate farm loa~
## <dbl>
<chr>
<dbl>
<dbl>
## 1 27
NE
3585. 1338.
## 2 45
VT
19.4 57.7
## 3 32
NY
426. 202.
## 4 47
WA
1229. 1101.
## 5 49
WI
1372. 1230.
## 6 11
HI
38.1 40.8
## 7 37
OR
571. 115.
## 8 48
WV
29.3 99.3
X=sum(loan$`Nonreal
estate farm loans`)
X
##
[1] 43908.12
n=8
N=50
## To estimate the
population mean
y_bar=(X/n*N)*sum(pps$`Real estate
farm loans`/pps$`Nonreal estate
farm loans`)
y_bar
##
[1] 2821638
# To estimate the popultion total
y_hat=(1/n)*X*sum(pps$`Real estate
farm loans`/pps$`Nonreal
estate farm loans`)
y_hat
##
[1] 56432.76
# To find the estimated variance of estimate
of population total
vt=(1/n*(n-1))*((sum(pps$`Real estate
farm loans`*X/pps$`Nonreal
estate farm loans`)^2)-(n*y_hat^2))
vt
##
[1] 156048141611
se=sqrt(vt)
se
##
[1] 395029.3
# To find the estimated variance of population
mean
vm=(1/N^2)*vt
vm
##
[1] 62419257
se=sqrt(vm)
se
##
[1] 7900.586
# To obtain the gain in efficiency of ppswr
over ratio estimate of population total.
s2_y=var(loan$`Real estate
farm loans`)
s2_y
##
[1] 342021.5
s2_x=var(loan$`Nonreal
estate farm loans`)
s2_x
##
[1] 1176526
rh=sum(loan$`Real estate
farm loans`)/sum(loan$`Nonreal
estate farm loans`)
c=cor(loan$`Nonreal
estate farm loans`,loan$`Real estate
farm loans`)
c
##
[1] 0.8038341
v_rat=((N-n)/(N*n))*(s2_y+rh^2*s2_x+2*rh*c*s2_x*s2_y)
v_rat
##
[1] 42963489681
eff=((vt-v_rat)/v_rat)*100
eff
##
[1] 263.211
ANALYSIS
The
positions of the sample drawn from the population is 27 45 32 47 49 11 37 48
using ppswr. Using the method of ppswr in R the following values of the real
estate farm loans using the information of no real estate farm loans are found:
Estimated population total : 112732.9 Estimated population mean of the real
estate farm loans : 5636647 Estimated variance and standard error of population
mean : 622727064030 and 789130.6 respectively Estimated variance and standard
error of population total : 249090826 and 15782.61 respectively. Gain in
efficiency of ppswr over ratio estimates : 1349.433
INTERPRETATION
From the above analysis we observe that the real estate farm loans
using the information of non real estate farm loans with varying probability is
$5636647 and interprets that the states and union territories have loans of
this amount based on the increase or decrease in the interest rates of non real
estate farm loans in the US in the ear 2007 and the variance here interprets
how much the real estate farm loans are varied from the non real estate farm
loans and in this analysis the variation of loans between real estate and the
non real estate farm loans is $27533422 which is good to give a perfect
estimate of the real estate farm loans of the 50 states and the union
territories. Thus we can say that using a sample size of 8 the total real
estate farm loans to be paid by the states is $5636647 with the variation being
$27533422 for a population total of 112733. The gain in efficiency of ppswr is
1349.433 over the ratio estimation which implies that ppswr is 1349% much
efficient compared to that of ratio estimation and thus, US is said to have had
a better payment of farm loans in 2007.
CONCLUSION
Thus, we can say that ppswr method is said to be more efficient to give the right estimates of the population from a sample of smaller size as well and also ppswr is efficient than ratio estimation which would give perfect estimates when the probability of selecting the sample varies. Therefore, when chances arise in such a way that the probability of drawing a unit from the population as a sample is varying this method gives analysis effective for the study.
Comments
Post a Comment