Proportional Sampling 

                                                                             


                                                                                        ~Sanchaita Bhattacharjee-2148150.

                                                                               Christ University, Bengaluru.


In this paper I have tried to explain the concept in as simple as possible so that it becomes easy even for non-statistical people to understand in a very easy way.

Here I have divided the topic firstly by giving the definition then some differences between proportionate and disproportionate sampling along with some examples and latter with R codes that will conclude this paper.


DEFINITION:   Proportional sampling is a method of sampling in which the investigator divides a finite population into subpopulations and then applies random sampling techniques to each subpopulation. Proportional sampling is similar to proportional allocation in finite population sampling, but in a different context, it also refers to other survey sampling situations. For a finite population with population size N, the population is divided into H strata (subpopulations) according to certain attributes. The size of the hth stratum is denoted as Nh and . Proportional sampling refers to a design with total sample size n such that

and

Following that, a simple random sample with sample size nh would be selected within each stratum. This is essentially the same as the stratified random sampling design with proportional allocation.


 Disproportional vs. Proportional Sampling


The main difference between the two sampling techniques is the proportion given to each stratum with respect to other strata. In proportional sampling, each stratum has the same sampling fraction while in disproportional sampling technique; the sampling fraction of each stratum varies.


       Example of Disproportional Sample


Suppose, for example, a researcher desires to conduct a survey of all the students in a given university with 10,000 students, 8,000 females and 2,000 males. His desired sample size is only 1,000. Since the 1,000 subjects needed for the survey is 10% of the entire population, sampling proportion suggests that 8/10 be female and 2/10 be male. This would result in a sample composed of 800 females and 200 males. In this case, the relatively small number of males in the sample probably would not provide adequate representation for drawing conclusion from the said survey.

Disproportional sample technique will permit the researcher in the mentioned case selection of students of adequate size from the two genders. Say for example, 500 males and 500 females can be selected to represent the population. This cannot be considered random since the males had better chances of being selected as part of the sample.

Examples for Proportionate Sampling .

Proportionate sampling is a is a sampling strategy (a method for gathering participants for a study) used when the population is composed of several subgroups that are vastly different in number. The number of participants from each subgroup is determined by their number relative to the entire population.


For example, imagine you want to create a council of 20 employees that will meet and recommend possible changes to the employee handbook. Let's say 40% of your employees are in Sales and Marketing, 30% in Customer Service, 20% of your employees are in IT, and 10% in Finance. You will randomly select 8 people from Sales and Marketing, 6 from Customer Service, 4 from IT, and 2 from Finance. As you can see, each number you pick is proportionate to the overall percentage of people in each category (e.g., 40% = 8 people).

               Estimation of population proportion:

 The population proportion in the case of qualitative characteristic can be estimated in a similar way as the estimation of population mean in the case of quantitative characteristic. Consider a qualitative characteristic based on which the population can be divided into two mutually exclusive classes, say C and C*. For example, if C is the part of the population of persons saying 'yes' or 'agreeing' with the proposal then C* is the part of population of persons saying 'no' or 'disagreeing' with the proposal. Let A be the number of units in C and (N-A) units in C* be the population of size N. Then the proportion of C is



An indicator variable Y can be associated with the characteristic under study and then for i = 1,2,…,N



Suppose a sample of size n is drawn from a population of size N by simple random sampling. Let a be the number of units in the sample which fall into class C and (n - a ) units fall in class C*, then the sample proportion of units in C is 








Note that the quantities y(bar), , and Y(bar) s^2 and  S^2 have been expressed as functions of sample and population proportions. Since the sample has been drawn by simple random sampling and sample proportion is same as the sample mean, so the properties of sample proportion in SRSWOR and SRSWR can be derived using the properties of the sample mean directly. 


        Proportional sampling using R 

#DATA DESCRIPTION: #A dataset of 2017 songs with attributes from Spotify’s API. Each song is labeled “1” meaning song is like and “0” for song not liked. This data is used to see what kind of songs are liked or disliked. In this study the variable of interest in mode. Population size of our data is 199.

#OBJECTIVE: #The objective for this study is to give estimates of proportion of the songs in spotify that are liked. Also I have found the 95% confidence limits both using simple random sampling without replacement.

library(readxl)
proportion_data <- read_excel("~/proportion data.xlsx")

## New names:
## * `` -> ...1

View(proportion_data)

population=proportion_data
population

## # A tibble: 199 x 12
##     ...1 acousticness danceability duration_ms energy instrumentalness   key
##    <dbl>        <dbl>        <dbl>       <dbl>  <dbl>            <dbl> <dbl>
##  1     0      0.0102         0.833      204600  0.434       0.0219         2
##  2     1      0.199          0.743      326933  0.359       0.00611        1
##  3     2      0.0344         0.838      185707  0.412       0.000234       2
##  4     3      0.604          0.494      199413  0.338       0.51           5
##  5     4      0.18           0.678      392893  0.561       0.512          5
##  6     5      0.00479        0.804      251333  0.56        0              8
##  7     6      0.0145         0.739      241400  0.472       0.00000727     1
##  8     7      0.0202         0.266      349667  0.348       0.664         10
##  9     8      0.0481         0.603      202853  0.944       0             11
## 10     9      0.00208        0.836      226840  0.603       0              7
## # ... with 189 more rows, and 5 more variables: liveness <dbl>, loudness <dbl>,
## #   mode <dbl>, song_title <chr>, artist <chr>

population

## # A tibble: 199 x 12
##     ...1 acousticness danceability duration_ms energy instrumentalness   key
##    <dbl>        <dbl>        <dbl>       <dbl>  <dbl>            <dbl> <dbl>
##  1     0      0.0102         0.833      204600  0.434       0.0219         2
##  2     1      0.199          0.743      326933  0.359       0.00611        1
##  3     2      0.0344         0.838      185707  0.412       0.000234       2
##  4     3      0.604          0.494      199413  0.338       0.51           5
##  5     4      0.18           0.678      392893  0.561       0.512          5
##  6     5      0.00479        0.804      251333  0.56        0              8
##  7     6      0.0145         0.739      241400  0.472       0.00000727     1
##  8     7      0.0202         0.266      349667  0.348       0.664         10
##  9     8      0.0481         0.603      202853  0.944       0             11
## 10     9      0.00208        0.836      226840  0.603       0              7
## # ... with 189 more rows, and 5 more variables: liveness <dbl>, loudness <dbl>,
## #   mode <dbl>, song_title <chr>, artist <chr>

?population

## No documentation for 'population' in specified packages and libraries:
## you could try '??population'

N=length(population$mode)
N

## [1] 199

set.seed(123)
srswor=population[sample(1:nrow(population),20,replace = FALSE),]
srswor

## # A tibble: 20 x 12
##     ...1 acousticness danceability duration_ms energy instrumentalness   key
##    <dbl>        <dbl>        <dbl>       <dbl>  <dbl>            <dbl> <dbl>
##  1   158      0.19           0.735      275840  0.41        0             11
##  2   178      0.0108         0.787      273853  0.658       0.0152         8
##  3    13      0.366          0.762      243270  0.476       0              0
##  4   194      0.00818        0.539      224907  0.582       0.0717         7
##  5   169      0.228          0.628      260947  0.568       0.0183        10
##  6    49      0.183          0.716      576888  0.957       0.058          3
##  7   117      0.00257        0.896      267024  0.623       0.000258       2
##  8    42      0.00156        0.624      183000  0.792       0.00000154    10
##  9   196      0.00275        0.476      195200  0.818       0.757          2
## 10   192      0.0346         0.19       230693  0.908       0.0369         0
## 11   152      0.689          0.658      238560  0.179       0              8
## 12    89      0.0397         0.786      302840  0.547       0.272          1
## 13    90      0.0213         0.734      321933  0.548       0             11
## 14   186      0.0585         0.552      283636  0.878       0.892          1
## 15   184      0.0414         0.774      269867  0.901       0.00000484     9
## 16    91      0.0454         0.812      268421  0.676       0              1
## 17   136      0.103          0.772      191160  0.582       0.0000661      4
## 18    98      0.716          0.685      427227  0.423       0.518         11
## 19    71      0.783          0.596      196627  0.639       0.569          0
## 20    25      0.00219        0.781      205160  0.795       0.269          7
## # ... with 5 more variables: liveness <dbl>, loudness <dbl>, mode <dbl>,
## #   song_title <chr>, artist <chr>

#suppose we are interested in estimation the proportion of songs on spotify that are liked (0=song disliked,1=song liked).

song1=sum(srswor$mode==1)
song1

## [1] 14

library(samplingbook)

## Loading required package: pps

## Loading required package: sampling

## Loading required package: survey

## Loading required package: grid

## Loading required package: Matrix

## Loading required package: survival

##
## Attaching package: 'survival'

## The following objects are masked from 'package:sampling':
##
##     cluster, strata

##
## Attaching package: 'survey'

## The following object is masked from 'package:graphics':
##
##     dotchart

Sprop(y=srswor$mode,m=song1,n=length(srswor$mode),N=199,level = .95)

##
## Sprop object: Sample proportion estimate
## With finite population correction: N = 199
##
## Proportion estimate:  0.7
## Standard error:  0.0997
##
## 95% approximate confidence interval:
##  proportion: [0.5046,0.8954]
##  number in population: [101,178]
## 95% exact hypergeometric confidence interval:
##  proportion: [0.4673,0.8744]
##  number in population: [93,174]

pop_total=199*0.7
pop_total

## [1] 139.3

ul=0.7+1.96*0.0997
ul

## [1] 0.895412

ll=0.7-1.96*0.0997
ll

## [1] 0.504588

#INTERPRETATION: #The data has population size N=199 and using srswor I have taken the sample size of 20. Since we are interested in knowing the proportion of songs liked hence we see that in the sample of size 20 their are 14 songs that are liked and 6 songs that are disliked. In order to find the proportion i have used the tool sprop. And I have got the proportion estimate as 0.7.This means that 70% songs in the population are liked songs and 30% of the songs are disliked songs. I have also got the 95% approximately confidence interval which is [0.5046,0.8954] and the range is from [101,178] which means that minimum of 101 songs are liked and maximum of 178 songs are liked. The population total that I have got is 139.3.






















Comments

Popular posts from this blog

PPSWOR AND HORVITZ THOMPSON ESTIMATOR

Population Proportion of Size Without Replacement Using DesRaj Estimator

HORVITZ-THOMPSON ESTIMATOR - An Unordered Estimator