1.1 Do first babies arrive late?
Anecdotal Evidence is based on data that is unpublished and usually personal. For example,
1.2 A Statistical Approach
Limitations of Anecdotal Evidence can be addressed by using the tools of statistics, which include Data Collection, Descriptive Statistics, Exploratory Data Analysis, Hypothesis Testing and Estimation.
1.3 The National Survey of Family Growth (NSFG)
NSFG is a cross-sectional study (it captures a snapshot of a group at a point in time). The alternative is a longitudinal study which observes a group repeatedly over a period of time. The people who participate in a survey are called respondents. Cross-sectional studies are meant to be representative, which means that every member of the target population has an equal chance of participating. NSFG is deliberately oversampled as certain groups are sampled at higher rates compared to their representation in US population. Drawback of oversampling is that it is hard to arrive at a conclusion based on statistics from the survey.
Exercise 1.2 Download data from the NSFG:
import pandas as pd
# Reference to extract the columns: http://greenteapress.com/thinkstats/survey.py
pregnancies = pd.read_fwf("2002FemPreg.dat",
names=["caseid", "nbrnaliv", "babysex", "birthwgt_lb",
"birthwgt_oz", "prglength", "outcome", "birthord",
"agepreg", "finalwgt"],
colspecs=[(0, 12), (21, 22), (55, 56), (57, 58), (58, 60),
(274, 276), (276, 277), (278, 279), (283, 285), (422, 439)])
pregnancies.head()
caseid | nbrnaliv | babysex | birthwgt_lb | birthwgt_oz | prglength | outcome | birthord | agepreg | finalwgt | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1.0 | 1.0 | 8.0 | 13.0 | 39 | 1 | 1.0 | 33.0 | 6448.271112 |
1 | 1 | 1.0 | 2.0 | 7.0 | 14.0 | 39 | 1 | 2.0 | 39.0 | 6448.271112 |
2 | 2 | 3.0 | 1.0 | 9.0 | 2.0 | 39 | 1 | 1.0 | 14.0 | 12999.542264 |
3 | 2 | 1.0 | 2.0 | 7.0 | 0.0 | 39 | 1 | 2.0 | 17.0 | 12999.542264 |
4 | 2 | 1.0 | 2.0 | 6.0 | 3.0 | 39 | 1 | 3.0 | 18.0 | 12999.542264 |
The description for the fields are as follows:
caseid | prglength | outcome | birthord | finalwgt |
---|---|---|---|---|
Integer ID of Respondent | Integer Duration of pregnancy in weeks | 1 indicates a live birth | code for first child: 1 | Number of people in US population this respondant represents |
Exercise 1.3 Explore the data in the Pregnancies table. Count the number of live births and compute the average pregnancy length (in weeks) for first babies and others for the live births.
pregnancies.describe()
caseid | nbrnaliv | babysex | birthwgt_lb | birthwgt_oz | prglength | outcome | birthord | agepreg | finalwgt | |
---|---|---|---|---|---|---|---|---|---|---|
count | 13593.000000 | 9148.000000 | 9144.000000 | 9144.000000 | 9087.000000 | 13593.000000 | 13593.000000 | 9148.000000 | 13241.000000 | 13593.000000 |
mean | 6216.526595 | 1.025907 | 1.494532 | 6.653762 | 7.403874 | 29.531229 | 1.763996 | 1.824552 | 24.230949 | 8196.422280 |
std | 3645.417341 | 0.252864 | 0.515295 | 1.588809 | 8.097454 | 13.802523 | 1.315930 | 1.037053 | 5.824302 | 9325.918114 |
min | 1.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 10.000000 | 118.656790 |
25% | 3022.000000 | 1.000000 | 1.000000 | 6.000000 | 3.000000 | 13.000000 | 1.000000 | 1.000000 | 20.000000 | 3841.375308 |
50% | 6161.000000 | 1.000000 | 1.000000 | 7.000000 | 7.000000 | 39.000000 | 1.000000 | 2.000000 | 23.000000 | 6256.592133 |
75% | 9423.000000 | 1.000000 | 2.000000 | 8.000000 | 11.000000 | 39.000000 | 2.000000 | 2.000000 | 28.000000 | 9432.360931 |
max | 12571.000000 | 9.000000 | 9.000000 | 9.000000 | 99.000000 | 50.000000 | 6.000000 | 9.000000 | 44.000000 | 261879.953864 |
live_births = pregnancies[pregnancies['outcome'] == 1]
print("Number of live births is: " + str(live_births.shape[0]))
mean_first = live_births[live_births['birthord'] == 1]['prglength'].mean()
mean_other = live_births[live_births['birthord'] != 1]['prglength'].mean()
print("Mean Pregnancy length for live births of first babies is: " + str(mean_first))
print("Mean Pregnancy length for live births of other babies is: " + str(mean_other))
print("Difference in Mean Pregnancy length for first and other babies is : " + str(mean_first - mean_other))
Number of live births is: 9148
Mean Pregnancy length for live births of first babies is: 38.6009517335
Mean Pregnancy length for live births of other babies is: 38.5229144667
Difference in Mean Pregnancy length for first and other babies is : 0.0780372667775
1.5 Significance
From the above analysis, it is evident that the difference in mean pregnancy lengths of first and other babies is 13.11 hours. A difference like this is called an apparent effect which means that there must be something going on but we are not sure yet. If the difference occurred by chance, we can conlcude that thet effect was not statistically significant. An apparent effect that is caused by bias, measurement error, or some other kind of error is called artifact.