Think Stats: Chapter 1

Posted by Amit Rajan on Sunday, August 5, 2018

1.1 Do first babies arrive late?

Anecdotal Evidence is based on data that is unpublished and usually personal. For example,

“My two friends that have given birth recently to their first babies, BOTH went almost 2 weeks overdue before going into labour or being induced.”
Anecdotal Evidence usually fail because of Small number of observations, Selection bias (People who join a discussion of this question might be interested because their first babies were late.), Confirmation bias (People who believe the claim might be more likely to contribute examples that confirm it) and Inaccuracy.

1.2 A Statistical Approach

Limitations of Anecdotal Evidence can be addressed by using the tools of statistics, which include Data Collection, Descriptive Statistics, Exploratory Data Analysis, Hypothesis Testing and Estimation.

1.3 The National Survey of Family Growth (NSFG)

NSFG is a cross-sectional study (it captures a snapshot of a group at a point in time). The alternative is a longitudinal study which observes a group repeatedly over a period of time. The people who participate in a survey are called respondents. Cross-sectional studies are meant to be representative, which means that every member of the target population has an equal chance of participating. NSFG is deliberately oversampled as certain groups are sampled at higher rates compared to their representation in US population. Drawback of oversampling is that it is hard to arrive at a conclusion based on statistics from the survey.

Exercise 1.2 Download data from the NSFG:

import pandas as pd
# Reference to extract the columns: http://greenteapress.com/thinkstats/survey.py
pregnancies = pd.read_fwf("2002FemPreg.dat",
                         names=["caseid", "nbrnaliv", "babysex", "birthwgt_lb",
                               "birthwgt_oz", "prglength", "outcome", "birthord",
                               "agepreg", "finalwgt"],
                         colspecs=[(0, 12), (21, 22), (55, 56), (57, 58), (58, 60),
                                (274, 276), (276, 277), (278, 279), (283, 285), (422, 439)])
pregnancies.head()

caseid nbrnaliv babysex birthwgt_lb birthwgt_oz prglength outcome birthord agepreg finalwgt
0 1 1.0 1.0 8.0 13.0 39 1 1.0 33.0 6448.271112
1 1 1.0 2.0 7.0 14.0 39 1 2.0 39.0 6448.271112
2 2 3.0 1.0 9.0 2.0 39 1 1.0 14.0 12999.542264
3 2 1.0 2.0 7.0 0.0 39 1 2.0 17.0 12999.542264
4 2 1.0 2.0 6.0 3.0 39 1 3.0 18.0 12999.542264

The description for the fields are as follows:

caseid prglength outcome birthord finalwgt
Integer ID of Respondent Integer Duration of pregnancy in weeks 1 indicates a live birth code for first child: 1 Number of people in US population this respondant represents

Exercise 1.3 Explore the data in the Pregnancies table. Count the number of live births and compute the average pregnancy length (in weeks) for first babies and others for the live births.

pregnancies.describe()

caseid nbrnaliv babysex birthwgt_lb birthwgt_oz prglength outcome birthord agepreg finalwgt
count 13593.000000 9148.000000 9144.000000 9144.000000 9087.000000 13593.000000 13593.000000 9148.000000 13241.000000 13593.000000
mean 6216.526595 1.025907 1.494532 6.653762 7.403874 29.531229 1.763996 1.824552 24.230949 8196.422280
std 3645.417341 0.252864 0.515295 1.588809 8.097454 13.802523 1.315930 1.037053 5.824302 9325.918114
min 1.000000 1.000000 1.000000 0.000000 0.000000 0.000000 1.000000 0.000000 10.000000 118.656790
25% 3022.000000 1.000000 1.000000 6.000000 3.000000 13.000000 1.000000 1.000000 20.000000 3841.375308
50% 6161.000000 1.000000 1.000000 7.000000 7.000000 39.000000 1.000000 2.000000 23.000000 6256.592133
75% 9423.000000 1.000000 2.000000 8.000000 11.000000 39.000000 2.000000 2.000000 28.000000 9432.360931
max 12571.000000 9.000000 9.000000 9.000000 99.000000 50.000000 6.000000 9.000000 44.000000 261879.953864
live_births = pregnancies[pregnancies['outcome'] == 1]
print("Number of live births is: " + str(live_births.shape[0]))
mean_first = live_births[live_births['birthord'] == 1]['prglength'].mean()
mean_other = live_births[live_births['birthord'] != 1]['prglength'].mean()
print("Mean Pregnancy length for live births of first babies is: " + str(mean_first))
print("Mean Pregnancy length for live births of other babies is: " + str(mean_other))
print("Difference in Mean Pregnancy length for first and other babies is : " + str(mean_first - mean_other))
Number of live births is: 9148
Mean Pregnancy length for live births of first babies is: 38.6009517335
Mean Pregnancy length for live births of other babies is: 38.5229144667
Difference in Mean Pregnancy length for first and other babies is : 0.0780372667775

1.5 Significance

From the above analysis, it is evident that the difference in mean pregnancy lengths of first and other babies is 13.11 hours. A difference like this is called an apparent effect which means that there must be something going on but we are not sure yet. If the difference occurred by chance, we can conlcude that thet effect was not statistically significant. An apparent effect that is caused by bias, measurement error, or some other kind of error is called artifact.