StatPac for Windows User's Guide
StatPac Home
 

Overview

System Requirements and Installation

System Requirements

Installation

Unregistering & Removing the Software from a PC

Network Operation

Updating to a More Recent Version

Backing-Up a Study

Processing Time

Server Demands and Security

Technical Support

Notice of Liability

Paper & Pencil and CATI Survey Process

Internet Survey Process

Basic File Types

Codebooks (.cod)

Data Manager Forms (.frm)

Data Files (.dat)

Internet Response Files (.asc or .txt)

Email Address Lists (.lst or .txt)

Email Logs (.log)

Rich Text Files (.rtf)

HTML Files (.htm)

Perl Script (.pl)

Password Files (.text)

Exported Data Files (.txt and .csv and .mdb)

Email Body Files (.txt or .htm)

Sample File Naming Scheme for a Survey

Customizing the Package

Problem Recognition and Definition

Creating the Research Design

Methods of Research

Sampling

Data Collection

Reporting the Results

Validity

Reliability

Systematic and Random Error

Formulating Hypotheses from Research Questions

Type I and Type II Errors

Types of Data

Significance

One-Tailed and Two-Tailed Tests

Procedure for Significance Testing

Bonferroni's Theorem

Central Tendency

Variability

Standard Error of the Mean

Inferences with Small Sample Sizes

Degrees of Freedom

Components of a Study Design

Elements of a Variable

Variable Format

Variable Name

Variable Label

Value Labels

Valid Codes

Skip Codes for Branching

Data Entry Control Parameters

Missing OK

Auto Advance

Caps Only

Codebook Tools

The Grid

Codebook Libraries

Duplicating Variables

Insert & Delete Variables

Move Variables

Starting Columns

Print a Codebook

Variable Detail Window

Codebook Creation Process

Method 1 - Create a Codebook from Scratch

Method 2 Create a Codebook from a Word-Processed Document

Spell Check a Codebook

Multiple Response Variables

Missing Data

Changing Information in a Codebook

Overview

Data Input Fields

Form Naming Conventions

Form Creation Process

Using the Codebook to Create a Form

Using a Word-Processed Document to Create a Form

Variable Text Formatting

Field Placement

Value Labels

Variable Separation

Variable Label Indent

Value Labels Indent

Space between Columns

Valid Codes

Skip Codes

Variable Numbers

Variable List and Detail Windows

Data Input Settings

Select a Specific Variable

Finding Text in the Form

Replacing Text in the Form

Saving the Codebook or Workspace

Overview

Keyboard And Mouse Functions

Create A New Data File

Edit Or Add To An Existing Data File

Select A Different Data File

Change Fields

Change Records

Enter A New Data Record

View Data For A Specified Record Number

Find Records That Contain Specified Data

Duplicate A Field From The Previous Record

Delete A Record

Data Input Settings

Compact Data File

Double Entry Verification

Print A Data Record

Variable List & Detail Windows

Data File Format

Overview

HTML Email Surveys

Plain Text Email Surveys

Brackets

Item Numbering

Codebook Design for a Plain Text Email Survey

Capturing a Respondent's Email Address

Filtering Email to a Mailbox

General Considerations for Plain Text Email

Overview

Internet Survey Process

Server Setup

Create the HTML Survey Pages

Upload the Files to the Web server

Test the survey

Download and import the test data

Delete the test data from the server

Conduct the survey

Download and import the data

Display a survey closed message

Server Setup

FTP Login Information

Paths & Folder Information

Design Considerations for Internet Surveys

Special Variables for Internet Surveys

Script to Create the HTML

Command Syntax & Help

Saving and Loading Styles

Survey Generation Procedure

Script Editor

Imbedded HTML Tags

Primary Settings

HTML Name (HTMLName=)

Banner Image(s)  (BannerImage=)

Heading  (Heading=)

Finish Text & Finish URL (FinishText= and FinishURL=)

Cookie (Cookie=)

IP Control (IPControl=)

Allow Cross Site (AllowCrossSite=)

URL to Survey Folder  (WebFolderURL=)

Advanced Settings - Header & Footer

RepeatBannerImage

RepeatHeading

PageNumbers

ContinueButtonText

SubmitButtonText

ProgressBar

FootnoteText & FootnoteURL

Advanced Settings - Finish & Popups

Thanks

Closed

HelpWindowWidth & HelpWindowHeight

HelpLinkText

LinkText

PopupBannerImage

PopupFullScreen

Advanced Settings - Control

Method

Email

RestartSeconds

MaximizeWindow

BreakFrame

AutoAdvance

BranchDelay

Cache

Index

ForceLoaderSubmit

ExtraTallBlankLine

RadioTextPosition

TextBoxTextPosition

LargeTextBoxPosition

LargeTextBoxProgressBar

Advanced Settings - Fonts & Colors

Global Attributes

Heading, Title, Text, & Footnote Attributes

Instructions, Question, and Response Attributes

Advanced Settings - Passwords - Color & Banner Image

LoginBannerImage

LoginBGColor

LoginWallpaper

LoginWindowColor

Advanced Settings - Passwords - Text & Control

PasswordType

LoginText

PasswordText

LoginButtonText

FailText

FailButtonText

ShowLink

EmailMe

KeepLog

Advanced Settings - Passwords - Single vs. Multiple

Password (single password method)

PasswordFile (multiple passwords method)

PasswordField & ID Field (multiple passwords method)

PasswordControl

Advanced Settings - Passwords - Technical Notes

Advanced Settings - Server Overrides

ActionTag

StorageFolder

ScriptFolder

Perl

MailProgram

Branching and Piping

Randomization (Rotations)

Survey Creation Script - Overview

Using Commands More than Once in a Script

Survey Creation - Specify Text

Heading

Title

Text

FootnoteText

Instructions

Question

Survey Creation - Spacing and pagination

BlankLine

NewPage

Survey Creation - Images and Links

Image

Link

Survey Creation - Help Windows

Survey Creation - Popup Windows

Survey Creation - Objects

Radio Buttons for a Single Variable

Radio Buttons for Grouped Variables (matrix style)

DropDown Menu

TextBox for a Single Variable

Adding a TextBox to a Radio Button,
    CheckBox, or Radio Button Matrix

TextBoxes for Grouped Variables

Sliders for Single or Grouped Variables

CheckBox for Multiple Response Variables

ListBox

Uploading and Downloading Files from the Server

Auto Transfer

FTP

Summary of the Most Common Script Commands

Overview

Format of an Email Address File

Extract Email Addresses

List Statistics

Join Two or More Lists

Split a List

Clean, Sort, and Eliminate Duplicates

Add ID Numbers to a List

Create a List of Nonresponders

Subtract One List From Another List

Merge an Email List into a StatPac Data File

Send Email Invitations

Using an ID Number to Track Responses

Email Address File

Body Text File

Sending Email

Overview

Mouse and Keyboard Functions

Designing Analyses

Continuation Lines

Comment Lines

V Numbers

Keywords

Analyses

Variable List

Variable Detail

Find Text

Replace Text

Options

Load, Save, and Merge Procedure Files

Print a Procedure File

Run a Procedure File

Results Editor

Graphics

Table of Contents

Automatically Generate Topline Procedures

Keyword Index

Keywords Overview

Categories of Keywords

Keyword Help

Ordering Keywords

Global and Temporary Keywords

Permanently Change a Codebook and Data File

Backup a Study

STUDY Command

DATA Command

SAVE Command

WRITE Command

MERGE Command

HEADING Command

TITLE Command

FOOTNOTE Command

LABELS Command

OPTIONS Command

SELECT and REJECT Commands

NEW Command

LET Command

STACK Command

RECODE Command

COMPUTE Command

AVERAGE, COUNT and SUM Commands

IF-THEN ELSE Command

SORT Command

WEIGHT Command

NORMALIZE Command

LAG Command

DIFFERENCE Command

DUMMY Command

RUN Command

REM Command

Reserved Words

Reserved Word RECORD

Reserved Word TOTAL

Reserved Word MEAN

Reserved Word TIME

Analyses Index

Analyses Overview

LIST Command

FREQUENCIES Command

CROSSTABS Command

BANNERS Command

DESCRIPTIVE Command

BREAKDOWN Command

TTEST Command

CORRELATE Command

Advanced Analyses Index

REGRESS Command

STEPWISE Command

LOGIT and PROBIT Commands

PCA Command

FACTOR Command

CLUSTER Command

DISCRIMINANT Command

ANOVA Command

CANONICAL Command

MAP Command

Advanced Analyses Bibliography

Utility Programs

Import and Export

StatPac and Prior Versions of StatPac Gold

Access and Excel

Comma Delimited and Tab Delimited Files

Files Containing Multiple Data Records per Case

Internet Files

Email Surveys

Merging Data Files

Concatenate Data Files

Merge Variables and Data

Aggregate

Codebook

Quick Codebook Creation

Check Codebook and Data

Sampling

Random Number Table

Random Digit Dialing Table

Select Random Records from Data File

Compare Data Files

Conversions

Date Conversions

Currency Conversion

Dichotomous Multiple Response
   Conversion

Statistics Calculator Menu

Distributions Menu

Normal distribution

T distribution

F distribution

Chi-square distribution

Counts Menu

Chi-square test

Fisher's Exact Test

Binomial Test

Poisson Distribution Events Test

Percents Menu

Choosing the Proper Test

One Sample t-Test between Percents

Two Sample t-Test between Percents

Confidence Intervals around a Percent

Means Menu

Mean and Standard Deviation of a Sample

Matched Pairs t-Test between Means

Independent Groups t-Test between Means

Confidence Interval around a Mean

Compare a Sample Mean to a Population Mean

Compare Two Standard Deviations

Compare Three or more Means

Correlation Menu

Sampling Menu

Sample Size for Percents

Sample Size for Means

Statistics Calculator

Statistics Calculator Menu

Statistics Calculator is an easy-to-use program designed to perform a series of basic statistical procedures related to distributions and probabilities. Most of the procedures are called inferential because data from a sample is used to infer to a population.

The menu bar of Statistic Calculator contains six types of operations that can be performed by the software.

 

Distributions

Counts

Percents

Means

Correlation

Sampling

 

The Distributions menu item is the electronic equivalent of probability tables.  Algorithms are included for the z, t, F, and chi-square distributions.  This selection may be used to find probabilities and critical values for the four statistics.

The Counts menu item contains routines to analyze a contingency table of counts, compute Fisher's exact probability for two-by-two tables, use the binomial distribution to predict the probability of a specified outcome, and the poisson distribution to test the likelihood of observing a specified number of events.

The Percents menu item is used to compare two percents.  Algorithms are included to compare proportions drawn from one or two samples.  There is also a menu option to calculate confidence intervals around a percent.

The Means menu item is used to calculate a mean and standard deviation of a sample, compare two means to each other, calculate a confidence interval around a mean, compare a sample mean to a population mean, compare two standard deviations to each other, and compare three or more standard deviations.

The Correlation menu item is used to calculate correlation and simple linear regression statistics for paired data. Algorithms are included for ordinal and interval data.    

The Sampling menu item is used to determine the required sample size for a study.  The software can be used for problems involving percents and means.

 

Distributions Menu

The Distributions menu selection is used to calculate critical values and probabilities for various distributions. The most common distributions are the z (normal) distribution, t distribution, F distribution, and the chi-square distribution. Within the last 20 years, computers have made it easy to calculate exact probabilities for the various statistics. Prior to that, researchers made extensive use of books containing probability tables.

Normal distribution

The normal distribution is the most well-known distribution and is often referred to as the z distribution or the bell shaped curve.  It is used when the sample size is greater than 30. When the sample size is less than 30, the t distribution is used instead of the normal distribution. 

The menu offers three choices: 1) probability of a z value, 2) critical z for a given probability, and 3) probability of a defined range.

Probability of a z Value

When you have a z (standardized) value for a variable, you can determine the probability of that value.  The software is the electronic equivalent of a normal distribution probability table. When you enter a z value, the area under the normal curve will be calculated. The area not under the curve is referred to as the rejection region.  It is also called a two-tailed probability because both tails of the distribution are excluded. The Statistics Calculator reports the two-tailed probability for the z value. A one-tailed probability is used when your research question is concerned with only half of the distribution. Its value is exactly half the two-tailed probability.

Example

z-value: 1.96

-----------------------------------------

Two-tailed probability = .0500

Critical z for a Given Probability

This menu selection is used to determine the critical z value for a given probability. 

Example

A large company designed a pre-employment survey to be administered to perspective employees.  Baseline data was established by administering the survey to all current employees.  They now want to use the instrument to identify job applicants who have very high or very low scores.  Management has decided they want to identify people who score in the upper and lower 3% when compared to the norm. How many standard deviations away from the mean is required to define the upper and lower 3% of the scores?

The total area of rejection is 6%.  This includes 3% who scored very high and 3% who scored very low.  Thus, the two-tailed probability is .06.  The z value required to reject 6% of the area under the curve is 1.881.  Thus, new applicants who score higher or lower than 1.881 standard deviations away from the mean are the people to be identified.

 

Two tailed probability:  .06

---------------------------------

z-value = 1.881

Probability of a Defined Range

Knowing the mean and standard deviation of a sample allows you to establish the area under the curve for any given range. This menu selection will calculate the probability that the mean of a new sample would fall between two specified values (i.e., between the limits of a defined range).

Example

A manufacturer may find that the emission level from a device is 25.9 units with a standard deviation of 2.7. The law limits the maximum emission level to 28.0 units.  The manufacturer may want to know what percent of the new devices coming off the assembly line will need to be rejected because they exceed the legal limit.

 

Sample mean = 25.9

Unbiased standard deviation = 2.7

Lower limit of the range = 0

Upper limit of the range = 28.0

----------------------------------------------------------------

Probability of a value falling within the range = .7817

Probability of a value falling outside the range = .2183

 

The area under the curve is the sum of the area defined by the lower limit plus the area defined by the upper limit.

The area under the normal curve is the probability that additional samples would fall between the lower and upper limits. In this case, the area above the upper limit is the rejection area (21.83% of the product would be rejected).

T distribution

 

Mathematicians used to think that all distributions followed the bell shaped curve. In the early 1900's, an Irish chemist named Gosset, discovered that distributions were much flatter than the bell shaped curve when working with small sample sizes.  In fact, the smaller the sample, the flatter the distribution. The t distribution is used instead of the normal distribution when the sample size is small. As the sample size approaches thirty, the t distribution approximates the normal distribution.  Thus, the t distribution is generally used instead of the z distribution, because it is correct for both large and small sample sizes, where the z distribution is only correct for large samples.

The menu offers three choices: 1) probability of a t value, 2) critical t value for a given probability, and 3) probability of a defined range.

Probability of a t Value

If you have a t value and the degrees of freedom associated with the value, you can use this program to calculate the two-tailed probability of t. It is the equivalent of computerized table of  t values.

Example

t-value: 2.228

df:  10

------------------------------------

Two-tailed probability = .050

Critical t Value for a Given Probability

This program is the opposite of the previous program. It is used if you want to know what critical t value is required to achieve a given probability.

Example

Two-tailed probability:  .050

Degrees of freedom:  10

-----------------------------------

t-value = 2.228

Probability of a Defined Range

Knowing the mean and standard deviation of a sample allows you to establish the area under the curve for any given range.  You can use this program to calculate the probability that the mean of a new sample would fall between two values.

Example

A company did a survey of 20 people who used its product.  The mean average age of the sample was 22.4 years and the unbiased standard deviation was 3.1 years. The company now wants to advertise in a magazine that has a primary readership of people who are between 18 and 24, so they need to know what percent of its potential customers are between 18 and 24 years of age?

 

Sample mean:  22.4

Unbiased standard deviation:  3.1

Sample size = 20

Lower limit of the range = 18

Upper limit of the range = 24

----------------------------------------------------------------

Probability of a value falling within the range = .608

Probability of a value falling outside the range = .392

 

Because of the small sample size, the t distribution is used instead of the z distribution.  The area under the curve represents the proportion of customers in the population expected to be between 18 and 24 years of age. In this example, we would predict that 60.8% of the its customers would be expected to be between 18 and 24 years of age, and 39.2% would be outside of the range.  The company decided not to advertise.

F distribution

The F-ratio is used to compare variances of two or more samples or populations. Since it is a ratio (i.e., a fraction), there are degrees of freedom for the numerator and denominator. This menu selection may be use to calculate the probability of an F -ratio or to determine the critical value of F for a given probability. These menu selections are the computer equivalent of an F table.

Probability of a F-Ratio

If you have a F-ratio and the degrees of freedom associated with the numerator and denominator, you can use this program to calculate the probability.

Example

F-ratio:  2.774

Numerator degrees of freedom:  20

Denominator degrees of freedom:  10

----------------------------------------------

Two-tailed probability = .0500

Critical F for a Given Probability

If you know the critical alpha level and the degrees of freedom associated with the numerator and denominator, you can use this program to calculate the F-ratio.

Example

Two-tailed probability = .0500

Numerator degrees of freedom:  20

Denominator degrees of freedom:  10

-----------------------------------------------

F-ratio:  2.774

Chi-square distribution

The chi-square statistic is used to compare the observed frequencies in a table to the expected frequencies. This menu selection may be use to calculate the probability of a chi-square statistic or to determine the critical value of chi-square for a given probability. This menu selection is the computer equivalent of an chi-square table.

Probability of a Chi-Square Statistic

If you have a chi-square value and the degrees of freedom associated with the value, you can use this program to calculate the probability of the chi-square statistic. It is the equivalent of computerized table of chi-square values.

Example

Chi-square value: 18.307

Degrees of freedom:  10

------------------------------------

Probability = .050

Critical Chi-Square for a Given Probability

If you have the critical alpha level and the degrees of freedom, you can use this program to calculate the probability of the chi-square statistic. It is the equivalent of computerized table of chi-square values.

Example

Probability = .0500

Degrees of freedom:  10

------------------------------------

Chi-square value: 18.307

 

Counts Menu

The Counts menu selection has four tests that can be performed for simple frequency data. The chi-square test is used to analyze a contingency table consisting of rows and columns to determine if the observed cell frequencies differ significantly from the expected frequencies. Fisher's exact test is similar to the chi-square test except it is used only for tables with exactly two rows and two columns. The binomial test is used to calculate the probability of two mutually exclusive outcomes. The poisson distribution events test is used to describe the number of events that will occur in a specific period of time.

Chi-square test

The chi-square is one of the most popular statistics because it is easy to calculate and interpret. There are two kinds of chi-square tests. The first is called a one-way analysis, and the second is called a two-way analysis. The purpose of both is to determine whether the observed frequencies (counts) markedly differ from the frequencies that we would expect by chance. 

The observed cell frequencies are organized in rows and columns like a spreadsheet.  This table of observed cell frequencies is called a contingency table, and the chi-square test if part of a contingency table analysis.

The chi-square statistic is the sum of the contributions from each of the individual cells. Every cell in a table contributes something to the overall chi-square statistic. If a given cell differs markedly from the expected frequency, then the contribution of that cell to the overall chi-square is large. If a cell is close to the expected frequency for that cell, then the contribution of that cell to the overall chi-square is low. A large chi-square statistic indicates that somewhere in the table, the observed frequencies differ markedly from the expected frequencies. It does not tell which cell (or cells) are causing the high chi-square...only that they are there. When a chi-square is high, you must visually examine the table to determine which cell(s) are responsible.  When there are exactly two rows and two columns, the chi-square statistic becomes inaccurate, and Yate's correction for continuity is often applied.

If there is only one column or one row (a one-way chi-square test), the degrees of freedom is the number of cells minus one.  For a two way chi-square, the degrees of freedom is the number or rows minus one times the number of columns minus one.

Using the chi-square statistic and its associated degrees of freedom, the software reports the probability that the differences between the observed and expected frequencies occurred by chance.  Generally, a probability of .05 or less is considered to be a significant difference. 

A standard spreadsheet interface is used to enter the counts for each cell.  After you've finished entering the data, the program will print the chi-square, degrees of freedom and probability of chance.

Use caution when interpreting the chi-square statistic if any of the cell frequencies are less than five.  Also, use caution when the total for all cells is less than 50.

Example

A drug manufacturing company conducted a survey of customers. The research question is: Is there a significant relationship between packaging preference (size of the bottle purchased) and economic status?  There were four packaging sizes: small, medium, large, and jumbo. Economic status was: lower, middle, and upper. The following data was collected.

                                lower      middle    upper

small                       24            22            18

medium                  23            28            19

large                       18            27            29

jumbo                     16            21            33

------------------------------------------------

Chi-square statistic = 9.743

Degrees of freedom = 6

Probability of chance = .1359

Fisher's Exact Test

The chi-square statistic becomes inaccurate when used to analyze contingency tables that contain exactly two rows and two columns, and that contain less than 50 cases.  Fisher's exact probability is not plagued by inaccuracies due to small N's.  Therefore, it should be used for two-by-two contingency tables that contain fewer than 50 cases.

Example

Here are the results of a recent public opinion poll broken down by gender. What is the exact probability that the difference between the observed and expected frequencies occurred by chance?

                                Male       Female

Favor                      30            42

Opposed                70            58

-------------------------------------------

Fisher's exact probability = .0249

Binomial Test

The binomial distribution is used for calculating the probability of dichotomous outcomes in which the two choices are mutually exclusive. The program requires that you enter the number of trials, probability of the desired outcome on each trial, and the number of times the desired outcome was observed.

 

Example

If we were to flip a coin one hundred times, and it came up heads seventy times, what is the probability of this happening?

 

Number of trials:  100

Probability of success on each trial (0-1):  .5

Number of successes:  70

---------------------------------------------------------

Probability of 70 or more successes < .0001

Poisson Distribution Events Test

The poisson distribution, like the binomial distribution, is used to determine the probability of an observed frequency.  It is used to describe the number of events that will occur in a specific period of time or in a specific area or volume.  You need to enter the observed and expected frequencies. 

Example

Previous research on a particular assembly line has shown that they have an average daily defect rate of 39 products.  Thus, the expected number of defective products expected on any day is 39.  The day after implementing a new quality control program, they found only 25 defects.  What is the probability of seeing 25 or fewer defects on any day?

 

Observed frequency:  25

Expected frequency:  39

---------------------------------------------------

Probability of 25 or fewer events = .0226

 

Percents Menu

Percents are understood by nearly everyone, and therefore, they are the most popular statistics cited in research.  Researchers are often interested in comparing two percentages to determine whether there is a significant difference between them. 

Choosing the Proper Test

There are two kinds of t-tests between percents.  Which test you use depends upon whether you're comparing percentages from one or two samples.

Every percentage can be expressed as a fraction.  By looking at the denominator of the fraction we can determine whether to use a one-sample or two-sample t-test between percents.  If the denominators used to calculate the two percentages represent the same people, we use a one-sample t-test between percents to compare the two percents.  If the denominators represent different people, we use the two-sample t-test between percents.

For example suppose you did a survey of 200 people.  Your survey asked,

                Were you satisfied with the program?

                                 ___ Yes    ___ No   ___ Don't know

Of the 200 people, 80 said  yes, 100 said no, and 20 didn't know.  You could summarize the responses as:

                Yes                         80/200 = .4 = 40%

                No                           100/200 = .5 = 50%

                Don't know            20/200 = .1 = 10%

Is there a significant difference between the percent saying yes (40%) and the percent saying no (50%)?  Obviously, there is a difference; but how sure are we that the difference didn't just happen by chance?  In other words, how reliable is the difference? 

Notice that the denominator used to calculate the percent of yes responses (200) represents the same people as the denominator used to calculate the percent of no responses (200).  Therefore, we use a one-sample t-test between proportions.  The key is that the denominators represent the same people (not that they are the same number).

After you completed your survey, another group of researchers tried to replicate your study.  They also used a sample size of 200, and asked the identical question.   Of the 200 people in their survey, 60 said  yes, 100 said no, and 40 didn't know.  They summarized their results as:

                Yes                        60/200 = .3 = 30%

                No                           100/200 = .5 = 50%

                Don't know            40/200 = .2 = 20%

Is there a significant difference between the percent who said yes in your survey (40%) and the percent that said yes in their survey (30%)?  For your survey the percent that said yes was calculated as 80/200, and in their survey it was 60/200.  To compare the yes responses between the two surveys, we would use a two-sample t-test between percents.  Even though both denominators were 200, they do not represent the same 200 people.

 

Examples that would use a one-sample t-test

 

Which proposal would you vote for?

     ___ Proposal A     ___ Proposal B

 

Which product do you like better?

     ___ Name Brand     ___ Brand X

 

Which candidate would you vote for?

     ___ Johnson     ___ Smith     ___ Anderson

When there are more than two choices, you can do the t-test between any two of them.  In this example, there are three possible combinations:  Johnson/Smith, Johnson/Anderson, and  Smith/Anderson.  Thus, you could actually perform three separate t-tests...one for each pair of candidates.  If this was your analysis plan, you would also use Bonferroni's theorem to adjust the critical alpha level because the plan involved multiple tests of the same type and family.

 

Examples that would use a two-sample t-test

 

A previous study found that 39% of the public believed in gun control.  Your study found the 34% believed in gun control.  Are the beliefs of your sample different than those of the previous study?

 

The results of a magazine readership study showed that 17% of the women and 11% of the men recalled seeing your ad in the last issue.  Is there a significant difference between men and women?

 

In a brand awareness study, 25% of the respondents from the Western region had heard of your product.  However, only 18% of the respondents from the Eastern region had heard of your product.  Is there a significant difference in product awareness between the Eastern and Western regions?

One Sample t-Test between Percents

This test can be performed to determine whether respondents are more likely to prefer one alternative or another.

Example

The research question is:  Is there a significant difference between the percent of people who say they would vote for candidate A and the percent of people who say they will vote for candidate B?  The null hypothesis is:  There is no significant difference between the percent of people who say they will vote for candidate A or candidate B.  The results of the survey were:

Plan to vote for candidate A = 35.5%

Plan to vote for candidate B = 22.4%

Sample size =  107

The sum of the two percents does not have to be equal to 100 (there may be candidates C and D, and people that have no opinion).  Use a one-sample t-test because both percentages came from a single sample.

Use a two-tailed probability because the null hypothesis does not state the direction of the difference.  If the hypothesis is that one particular choice has a greater percentage, use a one-tailed test (divide the two-tailed probability by two).

 

Enter the first percent:  35.5

Enter the second percent:  22.4

Enter the sample size:  107

-----------------------------------------

t-value = 1.808

Degrees of freedom = 106

Two-tailed probability = .074

 

You might make a statement in a report like this:  A one-sample t-test between proportions was performed to determine whether there was a significant difference between the percent choosing candidate A and candidate B.  The t-statistic was not significant at the .05 critical alpha level, t(106)=1.808, p=.073.  Therefore, we fail to reject the null hypothesis and conclude that the difference was not significant.

Two Sample t-Test between Percents

This test can be used to compare percentages drawn from two independent samples. It can also be used to compare two subgroups from a single sample.

Example

After conducting a survey of customers, you want to compare the attributes of men and women.  Even though all respondents were part of the same survey, the men and women are treated as two samples.  The percent of men with a particular attribute is calculated using the total number of men as the denominator for the fraction.  And the percent of women with the attribute is calculate using the total number of women as the denominator.  Since the denominators for the two fractions represent different people, a two-sample t-test between percents is appropriate.

The research question is: Is there a significant difference between the proportion of men having the attribute and the proportion of women having the attribute? The null hypothesis is: There is no significant difference between the proportion of men having the attribute and the proportion of women having the attribute. The results of the survey were:

86 men were surveyed and 22 of them (25.6%) had the attribute.

49 women were surveyed and 19 of them (38.8%) had the attribute.

 

Enter the first percent:  25.6

Enter the sample size for the first percent:  86

Enter the second percent:  38.8

Enter the sample size for the second percent:  49

-------------------------------------------------------------

t-value = 1.603

Degrees of freedom = 133

Two-tailed probability = .111

 

You might make a statement in a report like this:  A two-sample t-test between proportions was performed to determine whether there was a significant difference between men and women with respect to the percent who had the attribute. The t-statistic was not significant at the .05 critical alpha level, t(133)=1.603, p=.111.  Therefore, we fail to reject the null hypothesis and conclude that the difference between men and women was not significant.

Another example

Suppose interviews were conducted at two different shopping centers.  This two sample t-test between percents could be used to determine if the responses from the two shopping centers were different.

The research question is: Is there a significant difference between shopping centers A and B with respect to the percent that say they would buy product X?  The null hypothesis is: There is no significant difference between shopping centers A and B with respect to the percent of people that say they would buy product X. A two-tailed probability will be used because the hypothesis does not state the direction of the difference. The results of the survey were:

89 people were interviewed as shopping center A and 57 of them (64.0%) said they would buy product X.

92 people were interviewed as shopping center B and 51 of them (55.4%) said they would buy product X.

 

Enter the first percent:  64.0

Enter the sample size for the first percent:  89

Enter the second percent:  55.4

Enter the sample size for the second percent:  92

-------------------------------------------------------------

t-value = 1.179

Degrees of freedom = 179

Two-tailed probability = .240

 

You might write a paragraph in a report like this: A two-sample t-test between proportions was performed to determine whether there was a significant difference between the two shopping centers with respect to the percent who said they would buy product X.  The t-statistic was not significant at the .05 critical alpha level, t(179)=1.179, p=.240.  Therefore, we fail to reject the null hypothesis and conclude that the difference in responses between the two shopping centers was not significant.

Confidence Intervals around a Percent

Confidence intervals are used to determine how much latitude there is in the range of a percent if we were to take repeated samples from the population. 

Example

In a study of 150 customers, you find that 60 percent have a college degree. Your best estimate of the percent who have a college degree in the population of customers is also 60 percent.  However, since it is just an estimate, we establish confidence intervals around the estimate as a way of showing how reliable the estimate is.

Confidence intervals can be established for any error rate you are willing to accept. If, for example, you choose the 95% confidence interval, you would expect that in five percent of the samples drawn from the population, the percent who had a college degree would fall outside of the interval.

What are the 95% confidence intervals around this percent?  In the following example, note that no value is entered for the population size.  When the population is very large compared to the sample size (as in most research), it is not necessary to enter a population size.  If, however, the sample represents more than ten percent of the population, the formulas incorporate a finite population correction adjustment.  Thus, you only need to enter the population size when the sample size exceeds ten percent of the population size.

 

Enter the percent:  60

Enter the sample size:  150

Enter the population size:  (left blank)

Enter the desired confidence interval (%):  95

----------------------------------------------------------

Standard error of the proportion = .040

Degrees of freedom = 149

95%  confidence interval = 60.0%  7.9%

Confidence interval range = 52.1% to 67.9%

 

Therefore, our best estimate of the population proportion with 5% error is 60%  7.9%.  Stated differently, if we predict that the proportion in the population who have a college degree is between 52.1% and 67.9%, our prediction would be wrong for 5% of the samples that we draw from the population.

 

Means Menu

Researchers usually use the results from a sample to make inferential statements about the population.  When the data is interval or ratio scaled, it usually described in terms of central tendency and variability.  Means and standard deviations are usually reported in all research.

Mean and Standard Deviation of a Sample

This menu selection will let you enter data for a variable and calculate the mean, unbiased standard deviation, standard error of the mean, and median.  Data is entered using a standard spreadsheet interface. Finite population correction is incorporated into the calculation of the standard error of the mean, so the population size should be specified whenever the sample size is greater than ten percent of the population size.

Example

A sample of ten was randomly chosen from a large population.  The ten scores were:

20   22   54   32   41   43   47   51   45   35

----------------------------------------------------

Mean = 39.0

Unbiased standard deviation = 11.6

Standard error of the mean = 3.7

Median = 42.0

Matched Pairs t-Test between Means

The matched pairs t-test is used in situations where two measurements are taken for each respondent.  It is often used in experiments where there are before-treatment and after-treatment measurements.  The t-test is used to determine if there is a reliable difference between the mean of the before-treatment and the mean of the after treatment measurements.

 

Pretreatment                Posttreatment

Johnny -------------------- Johnny

Martha -------------------- Martha

Jenny ---------------------- Jenny

 

Sometimes, in very sophisticated (i.e., expensive) experiments, two groups of subjects are individually matched on one or more demographic characteristics.  One group is exposed to a treatment (experimental group) and the other is not (control group).

 

Experimental                      Control

Johnny -------------------------- Fred

Martha -------------------------- Sharon

Jenny ---------------------------- Linda

 

The t-test works with small or large N's because it automatically takes into account the number of cases in calculating the probability level.  The magnitude of the t-statistic depends on the number of cases (subjects). The t-statistic in conjunction with the degrees of freedom are used to calculate the probability that the difference between the means happened by chance.   If the probability is less than the critical alpha level, then we say that a significant difference exists between the two means. 

Example

A example of a matched-pairs t-test might look like this:

Pretest        Posttest

8                    31

13                  37

22                  45

25                  28

29                 50

31                 37

35                 49

38                 25

42                 36

52                 69

-----------------------------------------------------------

Var.1:  Mean = 29.5   Unbiased SD = 13.2

Var. 2:  Mean = 40.7   Unbiased SD = 13.0

t-statistic = 2.69

Degrees of freedom = 9

Two-tailed probability = .025

 

You might make a statement in a report like this:  The mean pretest score was 29.5 and the mean posttest score was 40.7. A matched-pairs t-test was performed to determine if the difference was significant. The t-statistic was significant at the .05 critical alpha level, t(9)=2.69, p=.025.  Therefore, we reject the null hypothesis and conclude that posttest scores were significantly higher than pretest scores.

Independent Groups t-Test between Means

This menu selection is used to determine if there is a difference between two means taken from different samples.  If you know the mean, standard deviation and size of both samples, this program may be used to determine if there is a reliable difference between the means.

One measurement is taken for each respondent.  Two groups are formed by splitting the data based on some other variable. The groups may contain a different number of cases.  There is not a one-to-one correspondence between the groups.

 

Score                      Sex                                                          Males                     Females

25                            M                                                            25                            27

27                            F              -----becomes----> 19                            17

17                            F                                                                                              21

19                            M

21                            F

 

Sometimes the two groups are formed because the data was collected from two different sources.

 

School A Scores                  School B Scores

525                                                          427

492                                                          535

582                                                          600

554

520

 

There are actually two different formulas to calculate the t-statistic for independent groups.  The t-statistics calculated by both formulas will be similar but not identical.  Which formula you choose depends on whether the variances of the two groups are equal or unequal.  In actual practice, most researchers assume that the variances are unequal because it is the most conservative approach and is least likely to produce a Type I error.  Thus, the formula used in Statistics Calculator assumes unequal variances.

Example

Two new product formulas were developed and tested.  A twenty-point scale was used to measure the level of product approval.  Six subjects tested the first formula. They gave it a mean rating of 12.3 with a standard deviation of 1.4. Nine subjects tested the second formula, and they gave it a mean rating of 14.0 with a standard deviation of 1.7.  The question we might ask is whether the observed difference between the two formulas is reliable.

 

Mean of the first group:  12.3

Unbiased standard deviation of the first group:  1.4

Sample size of the first group:  6

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Mean of the second group:  14.0

Unbiased standard deviation of the second group:  1.7

Sample size of the second group:  9

-------------------------------------------------------------------

t value =  2.03

Degrees of freedom = 13

Two-tailed probability = .064

 

You might make a statement in a report like this:  An independent groups t-test was performed to compare the mean ratings between the two formulas. The t-statistic was not significant at the .05 critical alpha level, t(13)=2.03, p=.064.  Therefore, we fail to reject the null hypothesis and conclude that there was no significant difference between the ratings for the two formulas.

Confidence Interval around a Mean

You can calculate confidence intervals around a mean if you know the sample size and standard deviation.

The standard error of the mean is estimated from the standard deviation and the sample size. It is used to establish the confidence interval (the range within which we would expect the mean to fall in repeated samples taken from the population). The standard error of the mean is an estimate of the standard deviation of those repeated samples.

The formula for the standard error of the mean provides an accurate estimate when the sample size is very small compared to the size of the population.  In marketing research, this is usually the case since the populations are quite large. Thus, in most situations the population size may be left blank because the population is very large compared to the sample.  However, when the sample is more than ten percent of the population, the population size should be specified so that the finite population correction factor can be used to adjust the estimate of the standard error of the mean. 

Example

Suppose that an organization has 5,000 members.  Prior to their membership renewal drive, 75 members were randomly selected and surveyed to find out their priorities for the coming year.  The mean average age of the sample was 53.1 and the unbiased standard deviation was 4.2 years.  What is the 90% confidence interval around the mean?  Note that the population size can be left blank because the sample size of 75 is less than ten percent of the population size.

 

Mean:  53.1

Unbiased standard deviation:  4.2

Sample size:  75

Population size:  (left blank -or- 5000)

Desired confidence interval (%):   90

-------------------------------------------------

Standard error of the mean = .485

Degrees of freedom = 74

90% confidence interval = 53.1  .8

Confidence interval range = 52.3 - 53.9

Compare a Sample Mean to a Population Mean

 

Occasionally, the mean of the population is known (perhaps from a previous census).  After drawing a sample from the population, it might be helpful to compare the mean of your sample to the mean of the population.  If the means are not significantly different from each other, you could make a strong argument that your sample provides an adequate representation of the population.  If, however, the mean of your sample is significantly different than the population, something may have gone wrong during the sampling process.

Example

After selecting a random sample of 18 people from a very large population, you want to determine if the average age of the sample is representative of the average age of the population. From previous research, you know that the mean age of the population is 32.0.  For your sample, the mean age was 28.0 and the unbiased standard deviation was 3.2.  Is the mean age of your sample significantly different from the mean age in the population?

 

Sample mean = 28

Unbiased standard deviation = 3.2

Sample size = 18

Population size = (left blank)

Mean of the population = 32

---------------------------------------

Standard error of the mean = .754

t value = 5.303

Degrees of freedom = 17

Two-tailed probability = .0001

 

The two-tailed probability of the t-statistic is very small. Thus, we would conclude that the mean age of our sample is significantly less than the mean age of the population. This could be a serious problem because it suggests that some kind of age bias was inadvertently introduced into the sampling process.  It would be prudent for the researcher to investigate the problem further.

Compare Two Standard Deviations

The F-ratio is used to compare variances.  In its simplest form, it is the variance of one group divided by the variance of another group.  When used in this way, the larger variance (by convention) is the numerator and the smaller is the denominator. Since the groups might have a different sample sizes, the numerator and the denominator have their own degrees of freedom.

Example

Two samples were taken from the population.  One sample had 25 subjects and the standard deviation 4.5 on some key variable.  The other sample had 12 subjects and had a standard deviation of 6.4 on the same key variable.  Is there a significant difference between the variances of the two samples?

 

First standard deviation:  4.5

First sample size:  25

Second standard deviation:  6.4

Second sample size:  12

-----------------------------------------

F-ratio = 2.023

Degrees of freedom = 11 and 24

Probability that the difference was due to chance = .072

Compare Three or more Means

Analysis of variance (ANOVA) is used when testing for differences between three or more means.

In an ANOVA, the F-ratio is used to compare the variance between the groups to the variance within the groups.  For example, suppose we have two groups of data.  In the best of all possible worlds, all the people in group one would have very similar scores.  That is, the group is cohesive, and there would be very little variability in scores within the group.  All the people in group two would also have similar scores (although different than group one).  Again, there is very little variability within the group.  Both groups have very little variability within their group, however, there might be substantial variability between the groups.  The ratio of the between groups variability (numerator) to the within groups variability (denominator) is the F-ratio.  The larger the F-ratio, the more certain we are that there is a difference between the groups. 

If the probability of the F-ratio is less than or equal to your critical alpha level, it means that there is a significant difference between at least two of groups.  The F-ratio does not tell which group(s) are different from the others...just that there is a difference.

After finding a significant F-ratio, we do "post-hoc" (after the fact) tests on the factor to examine the differences between levels.  There are a wide variety of post-hoc tests, but one of the most common is to do a series of special t-tests between all the combinations of levels for that factor.  For the post-hoc tests, use the same critical alpha level that you used to test for the significance of the F-ratio.

Example

A company has offices in four cities with sales representatives in each office.  At each location, the average number of sales per salesperson was calculated.  The company wants to know if there are significant differences between the four offices with respect to the average number of sales per sales representative.

 

Group     Mean      SD           N

1              3.29         1.38         7

2              4.90         1.45         10

3              7.50         1.38         6

4              6.00         1.60         8

-----------------------------------------------------------------------------------

Source    df            SS           MS          F              p

-----------------------------------------------------------------------------------

Factor       3            62.8         20.9         9.78         .0002

Error       27            57.8         2.13

Total       30            120.6

 

Post-hoc t-tests

Group     Group     t-value    df            p

1              2              2.23         15            .0412

1              3              5.17         11            .0003

1              4              3.58         13            .0034

2              3              3.44         14            .0040

2              4              1.59         16            .1325

3              4              1.09         12            .0019

 

Correlation Menu

Correlation is a measure of association between two variables. The variables are not designated as dependent or independent. The two most popular correlation coefficients are:  Spearman's correlation coefficient rho and Pearson's product-moment correlation coefficient.

When calculating a correlation coefficient for ordinal data, select Spearman's technique. For interval or ratio-type data, use Pearson's technique.

The value of a correlation coefficient can vary from minus one to plus one. A minus one indicates a perfect negative correlation, while a plus one indicates a perfect positive correlation. A correlation of zero means there is no relationship between the two variables. When there is a negative correlation between two variables, as the value of one variable increases, the value of the other variable decreases, and vise versa. In other words, for a negative correlation, the variables work opposite each other. When there is a positive correlation between two variables, as the value of one variable increases, the value of the other variable also increases. The variables move together.

The standard error of a correlation coefficient is used to determine the confidence intervals around a true correlation of zero.  If your correlation coefficient falls outside of this range, then it is significantly different than zero.  The standard error can be calculated for interval or ratio-type data (i.e., only for Pearson's product-moment correlation). 

The significance (probability) of the correlation coefficient is determined from the t-statistic. The probability of the t-statistic indicates whether the observed correlation coefficient occurred by chance if the true correlation is zero.  In other words, it asks if the correlation is significantly different than zero. When the t-statistic is calculated for Spearman's rank-difference correlation coefficient, there must be at least 30 cases before the t-distribution can be used to determine the probability.  If there are fewer than 30 cases, you must refer to a special table to find the probability of the correlation coefficient.

Example

A company wanted to know if there is a significant relationship between the total number of salespeople and the total number of sales.  They collect data for five months.

 

Var. 1      Var. 2

207          6907

180          5991

220          6810

205          6553

190          6190

--------------------------------

Correlation coefficient = .921

Standard error of the coefficient = ..068

t-test for the significance of the coefficient = 4.100

Degrees of freedom = 3

Two-tailed probability = .0263

 

Another Example

Respondents to a survey were asked to judge the quality of a product on a four-point Likert scale (excellent, good, fair, poor).  They were also asked to judge the reputation of the company that made the product on a three-point scale (good, fair, poor).  Is there a significant relationship between respondents perceptions of the company and their perceptions of quality of the product?

Since both variables are ordinal, Spearman's method is chosen.  The first variable is the rating for the quality the product.  Responses are coded as 4=excellent, 3=good, 2=fair, and 1=poor.  The second variable is the perceived reputation of the company and is coded 3=good, 2=fair, and 1=poor.

 

Var. 1      Var. 2

4              3

2              2

1              2

3              3

4              3

1              1

2              1

-------------------------------------------

Correlation coefficient rho = .830

t-test for the significance of the coefficient = 3.332

Number of data pairs = 7

Probability must be determined from a table because of the small sample size.

Regression

Simple regression is used to examine the relationship between one dependent and one independent variable.  After performing an analysis, the regression statistics can be used to predict the dependent variable when the independent variable is known.  Regression goes beyond correlation by adding prediction capabilities.

People use regression on an intuitive level every day.  In business, a well-dressed man is thought to be financially successful.  A mother knows that more sugar in her children's diet results in higher energy levels.  The ease of waking up in the morning often depends on how late you went to bed the night before.  Quantitative regression adds precision by developing a mathematical formula that can be used for predictive purposes.

For example, a medical researcher might want to use body weight (independent variable) to predict the most appropriate dose for a new drug (dependent variable).  The purpose of running the regression is to find a formula that fits the relationship between the two variables.  Then you can use that formula to predict values for the dependent variable when only the independent variable is known.  A doctor could prescribe the proper dose based on a person's body weight.   

The regression line (known as the least squares line) is a plot of the expected value of the dependent variable for all values of the independent variable.  Technically, it is the line that "minimizes the squared residuals". The regression line is the one that best fits the data on a scatterplot. 

Using the regression equation, the dependent variable may be predicted from the independent variable.  The slope of the regression line (b) is defined as the rise divided by the run.  The y intercept (a) is the point on the y axis where the regression line would intercept the y axis.  The slope and y intercept are incorporated into the regression equation. The intercept is usually called the constant, and the slope is referred to as the coefficient.  Since the regression model is usually not a perfect predictor, there is also an error term in the equation.

In the regression equation, y is always the dependent variable and x is always the independent variable. Here are three equivalent ways to mathematically describe a linear regression model.

y = intercept + (slope x) + error

y = constant + (coefficientx) + error

y = a + bx + e

The significance of the slope of the regression line is determined from the t-statistic.  It is the probability that the observed correlation coefficient occurred by chance if the true correlation is zero.  Some researchers prefer to report the F-ratio instead of the t-statistic.  The F-ratio is equal to the t-statistic squared.

The t-statistic for the significance of the slope is essentially a test to determine if the regression model (equation) is usable.  If the slope is significantly different than zero, then we can use the regression model to predict the dependent variable for any value of the independent variable.

On the other hand, take an example where the slope is zero.  It has no prediction ability because for every value of the independent variable, the prediction for the dependent variable would be the same. Knowing the value of the independent variable would not improve our ability to predict the dependent variable. Thus, if the slope is not significantly different than zero, don't use the model to make predictions.

The coefficient of determination (r-squared) is the square of the correlation coefficient.  Its value may vary from zero to one.  It has the advantage over the correlation coefficient in that it may be interpreted directly as the proportion of variance in the dependent variable that can be accounted for by the regression equation.  For example, an r-squared value of .49 means that 49% of the variance in the dependent variable can be explained by the regression equation.  The other 51% is unexplained.

The standard error of the estimate for regression measures the amount of variability in the points around the regression line.  It is the standard deviation of the data points as they are distributed around the regression line.  The standard error of the estimate can be used to develop confidence intervals around a prediction.

Example

A company wants to know if there is a significant relationship between its advertising expenditures and its sales volume.  The independent variable is advertising budget and the dependent variable is sales volume.  A lag time of one month will be used because sales are expected to lag behind actual advertising expenditures.  Data was collected for a six month period. All figures are in thousands of dollars. Is there a significant relationship between advertising budget and sales volume?

 

IV            DV

4.2           27.1

6.1           30.4

3.9           25.0

5.7           29.7

7.3           40.1

5.9           28.8

--------------------------------------------------

Model:  y = 10.079 + (3.700x) + error

Standard error of the estimate = 2.568

t-test for the significance of the slope = 4.095

Degrees of freedom = 4

Two-tailed probability = .0149

r-squared = .807

 

You might make a statement in a report like this:  A simple linear regression was performed on six months of data to determine if there was a significant relationship between advertising expenditures and sales volume. The t-statistic for the slope was significant at the .05 critical alpha level, t(4)=4.10, p=.015.  Thus, we reject the null hypothesis and conclude that there was a positive significant relationship between advertising expenditures and sales volume. Furthermore, 80.7% of the variability in sales volume could be explained by advertising expenditures.

 

Sampling Menu

The formula to determine sample size depends upon whether the intended comparisons involve means or percents.

Sample Size for Percents

 

This menu selection is used to determine the required size of a sample for research questions involving percents.

Four questions must be answered to determine the sample size:

1. Best estimate of the population size: You do not need to know the exact size of the population. Simply make your best estimate. An inaccurate population size will not seriously affect the formula computations.  If the population is very large, this item may be left blank.

2. Best estimate of the rate in the population (%):  Make your best estimate of what the actual percent of the survey characteristic is. This is based on the null hypothesis.  For example, if the null hypothesis is "blondes don't have more fun", then what is your best estimate of the percent of blondes that do have more fun?  If you simply do not know, then enter 50 (for fifty percent).

3. Maximum acceptable difference (%):  This is the maximum percent difference that you are willing to accept between the true population rate and the sample rate. Typically, in social science research, you would be willing to accept a difference of 5 percent. That is, if your survey finds that 25 percent of the sample has a certain characteristic, the actual rate in the population may be between 20 and 30 percent.

4. Desired confidence level (%):  How confident must you be that the true population rate falls within the acceptable difference (specified in the previous question)?  This is the same as the confidence that you want to have in your findings.  If you want 95 percent confidence (typical for social science research), you should enter 95.  This means that if you took a hundred samples from the population, five of those samples would have a rate that exceeded the difference you specified in the previous question. 

Example

A publishing wants to know what percent of the population might be interested in a new magazine on making the most of your retirement.  Secondary data (that is several years old) indicates that 22% of the population is retired.  They are willing to accept an error rate of 5% and they want to be 95% certain that their finding does not differ from the true rate by more than 5%.  What is the required sample size? 

 

Best estimate of the population size:  (left blank)

Best estimate of the rate in the population (%):  22

Maximum acceptable difference (%): 5

Desired confidence level (%):  95

-------------------------------------------------------------

Required sample size = 263

Sample Size for Means

This menu selection is used to determine the required size of a sample for research questions involving means.

Three questions must be answered to determine the sample size:

1. Standard deviation of the population:  It is rare that a researcher knows the exact standard deviation of the population. Typically, the standard deviation of the population is estimated a) from the results of a previous survey, b) from a pilot study, c) from secondary data, or d) or the judgment of the researcher.

2. Maximum acceptable difference:  This is the maximum amount of error that you are willing to accept. That is, it is the maximum difference that the sample mean can deviate from the true population mean before you call the difference significant.

3. Desired confidence level (%):  The confidence level is your level of certainty that the sample mean does not differ from the true population mean by more than the maximum acceptable difference. Typically, social science research uses a 95% confidence level.

Example

A fast food company wants to determine the average number of times that fast food users visit fast food restaurants per week.  They have decided that their estimate needs to be accurate within plus or minus one-tenth of a visit, and they want to be 95% sure that their estimate does differ from true number of visits by more than one-tenth of a visit.  Previous research has shown that the standard deviation is .7 visits.  What is the required sample size?

 

Population standard deviation:  .7

Maximum acceptable difference:  .1

Desired confidence interval (%):  95

--------------------------------------------

Required sample size = 188