Name: JOANNA JONES

Registration No.: 2148134

DEPARTMENT OF STATISTICS

CHRIST (DEEMED TO BE UNIVERSITY), BENGALURU

SAMPLING FOR PROPORTIONS UNDER WITHOUT REPLACEMENT

In several instances, the characteristic under study on which the observations are gathered is qualitative or categorical in nature. For example, the items coming off an assembly line are defective or not; or in marketing surveys, the customers are asked to arrange their preferences in order, like first preference, second preference, etc. In these scenarios, the question that arises is how does one carry out the sampling and how can the population parameters like population mean, population total, etc. be estimated.

Thus, in this blog, we shall go about answering these questions and also solve an example using R.

Procedure for Sampling

The procedure to draw a sample possessing a qualitative characteristic is the same as drawing a sample in the case of quantitative characteristics. Hence, the procedure for sampling remains the same irrespective of the nature of the characteristic under study, be it qualitative or quantitative. For instance, the procedure to draw samples using Simple Random Sampling Without Replacement (SRSWOR) and Simple Random Sampling With Replacement (SRSWR) remain the same for qualitative and quantitative characteristics. In the same way, other sampling methods such as stratified sampling, systematic sampling, etc., also remain the same.

Estimation of Population Proportion:

In the population, the units are classified into two categories:

Possessing a particular characteristic
Not possessing that characteristic

We shall consider the example of items coming off an assembly line being defective or not defective. If the item is defective, we say that it possesses the characteristic, 'defective' and will be denoted as C. If it is not defective, we say that it does not possess the particular characteristic, 'defective', and will be denoted as C*. Thus, the population is divided into two mutually exclusive classes, C and C*.

Let A be the number of units in C and (N - A) be the number of units in C*, which are in a population of size N. Then, the proportion of units in C is

and the proportion of units in C* is

We can associate the characteristic under study with an indicator variable Y_i for i = 1, 2, ..., N as having value 1, if the i^th unit belongs to C and 0, if the i^th unit belongs to C*. In other words

The population total is

and the population mean is

Hence, the question of estimating the population proportion becomes that of estimating the population mean by defining the variable Y_i as above.

Estimation of Sample Proportion:

Now, consider a sample of size n is drawn from a population of size N by simple random sampling (SRS).

Let a be the number of units in the sample which fall into class C and (n - a) units fall in class C*, then the sample proportion of units in C is

which can also be written as

Estimation of Variance:

Since

thus, we can write S² and s² in terms of P and Q as follows:

Similarly, as

and

NOTE:

The quantities y̅, s² are used to express the functions of sample proportions, and Y̅ S² are used to express the functions of population proportions.

As the sample has been drawn using simple random sampling, and sample proportion is the same as the sample mean, so the properties of sample proportion in SRSWOR and SRSWR can be derived using the properties of the sample mean directly.

SRSWOR

Mean:

The sample mean y̅ is an unbiased estimator of the population mean Y̅, i.e.

in the case of SRSWOR, hence

and thus, p is as unbiased estimator of P.

Variance:

Now, we shall use the expression of Var(y̅) so as to derive the variance of p:

Similarly, we shall use the estimate of Var(y̅), so as to derive the estimate of the variance of p:

Estimation of Population Total or Total Number of Count:

The estimate of the population total A (or total number of count) is

its variance is

and the estimate of its variance is

Confidence Interval Estimation of P:

If N and n are large then,

approximately follows N(0,1). With this approximation, we can write

and the 100(1 - α)% confidence interval of P is

Applications of Sampling for Proportions:

Television ratings are determined by estimating the proportion of the public who watch a certain program.
Responses of customers in marketing surveys that are based on replies such as ‘yes’ or ‘no’, ‘agree’ or ‘disagree’, etc.
Estimating the population proportion or the percentage of brown-eyed people, unemployed people, those who have graduated or people favoring a proposal, etc.
Proportion of all items coming off an assembly line that is defective or not.
Proportion of all people entering a retail store who purchase before leaving.

Thus, there is a wide range of applications pertaining to sampling for proportions.

Explanation of Sampling for Proportions Using R

The data set chosen refers to the Prostate Cancer data. Prostate cancer begins when cells in the prostate gland start to grow out of control. The prostate gland is found only in males, which produces seminal fluid. This data set has been taken from the website Kaggle and the URL for the same is https://www.kaggle.com/sajidsaifi/prostate-cancer.

We see that there are 100 observations of 10 variables. We need to estimate proportions, hence we shall specifically analyze the variable - diagnosis_result, which describes whether the diagnosis results in malignant, meaning the cancer can grow and spread to other parts of the body and it is denoted by 1, or benign, meaning the cancer can grow but not spread and it is denoted by 0.

Summary of the Report

The Prostate Cancer dataset was utilized to carry out the analysis. Prostate cancer begins when cells in the prostate gland start to grow out of control.

The population size was 100 and the sample size was 20. Using the Sprop() function, we observed that 65% of the patients were diagnosed as malignant in the population, under the proportion estimate. Also the standard error was 0.0979, which refers that on an average 0.0979 difference will be there in the estimates of the results of the patients diagnosed as malignant in the different samples of size 20 from the population of size 100.
The estimate of the population total is 65 patients who were diagnosed as malignant.

The Confidence Intervals calculated are done by assuming that the population is approximately normally distributed.

The 95% Confidence Limits / Interval for the population proportion of the number of patients diagnosed as malignant was [0.4582, 0.8418]. This provides the range in which the proportion of patients diagnosed as malignant will vary from sample to sample. Thus, a minimum of 0.4582 patients are diagnosed as malignant and a maximum of 0.8418 patients are diagnosed as malignant.
The 95% Confidence Limits / Interval for the population total of the number of patients diagnosed as malignant was [46, 84]. This provides the range in which the total number of patients diagnosed as malignant will vary from sample to sample. Thus, a minimum of 46 patients are diagnosed as malignant and a maximum of 84 patients are diagnosed as malignant.

Search This Blog

Complex Sample Survey Designs