Predicting the Lengths of Gestation for the Most Recent Births from Odds, Odds Ratios and Dummy Variable Regression Model

Research Article

Predicting the Lengths of Gestation for the Most Recent Births from Odds, Odds Ratios and Dummy Variable Regression Model

  • Uchechukwu Marius Okeh *

International Institute for Nuclear Medicine and Allied Health Research, David Umahi Federal University of Health Sciences Uburu Ebonyi State, Nigeria.

*Corresponding Author: Uchechukwu Marius Okeh, International Institute for Nuclear Medicine and Allied Health Research, David Umahi Federal University of Health Sciences Uburu Ebonyi State, Nigeria.

Citation: Uchechukwu M. Okeh. (2025). Predicting the Lengths of Gestation for the Most Recent Births from Odds, Odds Ratios and Dummy Variable Regression Model, Journal of Women Health Care and Gynecology, BioRes Scientia Publishers. 5(6):1-8. DOI: 10.59657/2993-0871.brs.25.099

Copyright: © 2025 Uchechukwu Marius Okeh, this is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Received: July 07, 2025 | Accepted: October 10, 2025 | Published: October 17, 2025

Abstract

Background: One important advantage of dummy variable regression over ordinary regression is that instead of accommodating only quantitative response and explanatory variables, qualitative explanatory variables, can be incorporated into a linear model. A dummy-variable regressor is coded to represent a dichotomous outcome. This paper proposes a method of estimating probabilities, odds and odds ratios of dichotomized outcomes using dummy variable regression. 

Methodology: Dummy variable regression in which both the dependent and independent variables are binary is used. This dummy variable regression is used to estimate the probabilities, odds and odds ratios of dichotomous outcomes. To do this, we first partitioned each of the parent independent variables into a set of mutually exclusive categories or subgroups and then use dummy variables to represent these categories in a regression model. In such a regression model, each parent independent variable is represented by one dummy variable of 1’s and 0’s less than the number of its categories. Any level of a parent independent variable that is not specifically represented by a dummy variable is referred to as the excluded level of that parent variable while the others are termed the included levels in the regression model. Data collection and its analysis were carried out retrospectively and using SPSS version 14 respectively.   

Results: We illustrated the proposed method using the data to obtain the probabilities, odds and odds ratios of dichotomous outcomes such as the probability that any randomly selected mother has a gestation period of certain weeks for her last birth, the odds that a randomly selected mother from the population has a gestation period of some stated weeks and the odds that the randomly selected mother has a male birth with gestation period of some stated weeks were among others all estimated.

Conclusions: We conclude that dummy variable regression enables one to estimate as probabilities, odds and odds ratios of dichotomous outcomes also even as logistic regression can be used.


Keywords: odds; odds ratio; dichotomous response; dummy variable regression; probabilities

Introduction

Several researchers use the concept of multiple regression analysis and ANOVA in their statistical research analysis. Multiple regression analysis is frequently used in different aspects of life. (oberkirchner et al, 2010) uses multiple regression to analyses and develop a model for the effect of essential material and process parameters to weight and moisture content of impregnated papers. (Bajpai, 2013) analyze university model using multiple regression and ANCOVA and found very essential. (Syla, 2013) study the significance of active-employment programs on employment levels using multiple regression. (Everarda et al, 2005) uses multiple regression result to study the importance of engaging student to graphical user interface in teaching statistical courses. (Oswald, 2012) shows how viewing multiple regression results through multiple lenses can give a better assessment to the researchers. (Kelley et al, 2003) shows that in multiple regression obtaining accurate parameter contributes more than having statistical significance. (Pazzani et al, 1981) shows how independent sign regression generate linear model that are almost accurate as multiple regression. (Ludlow, 2014) study suppressor variables and suppression effects in building regression model. (Moya-laraño et al, 2008) encourages ecological researchers to use partial regression in their studies. (Breheny et al, 2013) uses visreg package which is useful tool in visualizing the relationship between explanatory variables that is estimated. Visreg construct convenient support in regression model.

Often subject or candidate for an examination or job interview may wish to estimate the probability of success given some predisposing factors such as the number of hours the subject studied per day or per week, the nature, type and duration of the examination, the candidates’ prior qualifications, age, gender, ethnic group, state of origin, etc. A Clinician conducting a diagnostic test or drug trial for a certain condition may wish to know the odds that subjects or patients respond positive given their various characteristics such as age, gender, body weight, family history, etc. A Gynecologist or Pediatrician may wish to estimate the odds that a new born baby is under-weight or has more than normal gestation period even given the mothers’ age, parity, body weight and the Childs’ gender. etc. In other words, the response to the condition of interest is dichotomous assuming one of two possible values and the predisposing factors are either categorical variables or could be subdivided into a number of mutually exclusive sub-groups or classes. This would hence enable the fitting of a multiple regression model in which both the dependent and independent variables are categorical and used in estimating the odds of occurrence of the outcome of interest as hereunder discussed.

The Proposed Method

Suppose a researcher collects a random sample of size n respondent, subjects, or patients from a certain population; for investigation for the presence or absence of some condition. Let yi be the response of the ith subject to the condition under study in the presence of some predisposing factors on parent explanatory variables A, B, C, …. with levels a, b, c… respectively for i=1, 2, …, n.

Let

Interest is in representing each of the parent explanatory variables A, B, C, …. as dummy variables of 1s and 0s and using them in a multiple dummy variable regression model in which y is the dependent variable with two mutually exclusive outcomes. To do this each of the parent independent or explanatory variables is represented by one dummy variable less than the number of its categories or levels. This is to avoid linear dependence among the columns of the design matrix X of the regression model and hence ensure that X is of full column rank (Boyle,1970; Nates and Wasserman,1974; Oyeka,1992).

This let,

Following these specifications, we may now fit the multiple dummy variable regression model expressing the dependence of  yi on the xjs as

Where are regression parameters or coefficients and are error terms uncorrelated with the with for i=1, 2, …, n. Note that Equation 3 may alternatively be represented in its matrix form as

Where is an nx1 column vector of 1’s and 0’s representing the n scores or responses of subjects to the condition of interest, X is an nxr design matrix of 1’s and 0’s, is an rx1 column vector of regression parameters and is an nx1 column vector of error terms uncorrelated with X with where n is the number of parameters(regression coefficients) in the model (Equation 3).

Applying the method of least squares to either Equation 3 or 4 yields an unbiased estimator of as

The following analysis of variance (ANOVA) Table is used to test the adequacy of Equation 3 or 4 based on the F-test statistic.

Source of VariationSum of Square (SS)Degree of Freedom (DF)Mean Sum of Square (MS)F-Ratio
Regressionr-1
Errorn-r 
Totaln-1  

Table 1: ANOVA Table for Equation 4.

The null hypothesis to be tested for the adequacy of Equation 4 using the results of Table 1 is

is rejected at the level of significance if

Otherwise is accepted where is the critical value of F distribution with r-1 and n-r degrees of freedom for a specified of significance. If the model fits, that is if is rejected so that not all the are zero, then we may proceed to estimate the required probabilities and odds of positive responses to the condition of interest.

 Now from Equation 4 we have that the expected value of y is

Or equivalently from Equation 3 we have that

Which is the expected proportion of positive responses or the probability that the ith subject responds positive (1) to the condition of interest.

The expected probability that the ith subject responds negative (0) is

Hence the odds that the ith randomly selected subject responds positive to the condition under study is

In particular interest may be on some specific levels of some parent explanatory variable such as the jth level of factor A say. Then to find the probability that the ith randomly selected subject in the jth level of factor A responds positive to the condition we set and all other in Equation 9 yielding

For j=1, 2,….,a-1

This is the probability that the jth level of factor A together with the omitted levels of all the other factors in the model (the levels) omitted in the specifications Equation 2 respond positive to the condition under study. Similarly, the probability that this subject (ith the jth level of factor A and omitted levels of the other factors) responds

Hence the odds that the ith randomly selected subject in the jth level of factor A and the omitted levels of the other factors responding positive to the condition under study is

Now that Equations 12 13 and 14 are obtained from Equations 9,10 and 11 respectively by certain .

Equations 12 and 13 are respectively the probability that a randomly selected subject responds positives and the probability that the subject responds negative to the condition under study while Equation 14 is the odds of positive response. In general, if interest is in determining the odds of positive response by a randomly selected subject in the jth level of factor A, lth level of factor B, sth level of factor C and omitted levels of other factors in the model,

Finally, the odds ratio of positive response or of experiencing the condition by a randomly selected subjects in the jth and kth levels of factor A and omitted levels of other factors in the model is

Results

We here illustrate the present method using data on the lengths of gestation (in weeks) for the most recent births of a random samples of n=41 women by age and parity of mother and gender of the last birth (Table 2).

Table 2: Data on Lengths of Gestation for last Births by Maternal Age, Parity and Gender of last birth.

S/NMothersParityGender of last birthLength of Gestation for last Birth
1283F40
2368M34+1
3301F39+2
4253F41+5
5271M40
6301M38+6
7273M40
8201M40+5
9316M39
10316F39
11271M40
12190M38+2
13303F41
14305M39
15392M41+5
16251F40+5
17294M40+3
18232M39
19302M37+5
20280M40+4
21240M37+5
22201M40+4
23306M42+3
24324F41+2
25221F37+3
26250F38+5
27220F39+3
28334F39
29291M40+1
30293F39+5
31250M37
32282F38+3
33261F40
34284M37+4
35352M39+2
36250M40+3
37340F40
38260F36
39306M42+3
40327F40+2
41256M38

Using length of gestation for last birth as the dependent variable and mothers age, parity and gender of last birth as the independent variables, we may proceed as follows, Let

The resulting dummy variable multiple regression model is then

The regression coefficients of Equation 18 are estimated from Equation 5 yielding the predicted regression model

The corresponding analysis of variance table is presented in table 3.

Table 3: Anova Table for Equation 19.

Source of VariationSum of SquaresDegree of FreedomMean Sum of SquareF-Ratio
Regression0.69250.1340.491
Error9.572350.273 
Total10.24440  

The present model explains only about 6.6% of the total variation in length of gestation and hence the null hypothesis of equation 6 is not rejected. 

Now the findings of no association between length of gestation and the independent, age, parity and gender of last birth, that is, the acceptance of H0 would ordinarily signal the end of statistical analysis. However, we here for illustration purposes only, the calculation of the probabilities, odds and odds ratios of occurrence of the condition under study namely that a randomly selected mother has a gestation period of more than 39.5 weeks for her last birth.

The probability that the ith randomly selected mother has a gestation period of over 39.5 weeks for her last birth is estimated from equation 19 and probability that her gestation period lasted for less than 39.5 weeks is using equation 19 in equation 10.

Hence, the odds that a randomly selected mother from the population has a gestation period of more than 39.5 weeks is estimated from equations 11 and 19 to 20 as

In particular, the odds that the randomly selected mother has a male birth with gestation period of more than 39.5 weeks is obtained using equation 21 in equation 14 by setting in equation 21 yielding.

This means that for every one thousand males’ births with a gestation period equal to or less than 39.5 weeks; 636 have a gestation period of over 39.5weeks. The odds that the last female birth of a randomly selected mother has a gestation period of over 39.5weeks is obtained by setting all in equations 19 and 20 and taking the ratio yielding

This means that for every one thousand female births with gestation period of at most 39.5 weeks 639 have a gestation period of over 39.5 weeks.

The odds ratio is therefore

In other words, for every 1000 female births with a gestation period of more than 39.5 weeks, there are 995 males’ births with the same gestation period of 39.5 weeks. Note that the estimated regression coefficient when interpreted means that if mothers’ parity and gender of last birth are held constant than the probability that the length of gestation for the birth by a randomly selected mother is over 39.5 weeks is expected to be lower on the average by 7.4 percent if the woman is aged 25 years or less, than if she belongs to any other age group. Similarly, implies that if age of mother and gender of child are held at constant levels then the probability is 14.5 percent higher on the average for a randomly selected mother with a parity of two or three children when compared with other mothers that the length of gestation if her last birth exceeds 39.5 weeks.

However, interpreting these estimated regression coefficients in terms of absolute probabilities and odds would seem more illuminating. Thus, the probability that the most recent male birth by a randomly selected woman aged 30 years or more with more than three births has a gestation period of over 39.5 weeks is obtained as in Equation 2 by in Equation 19 yielding

The probability that the length of gestation for the male birth is equal to or less than 39.5 weeks (see Equation 15) is therefore

Hence the corresponding odds for this event is estimated as

This means that for every 1000 most recent male births with a gestation period of at most 39.5 weeks by women aged 30 years or more with more than three children we would expect about 637 of these most recent births to have a gestation period of over 39.5 weeks. Also, the probability that the most recent female birth by a randomly selected mother aged 30 years or more with more than three children has a gestation period of over 39.5 weeks is obtained by setting in Equation 19 yielding

The complementary probability is

The corresponding odds is estimated as

This means that for every 1000 most recent female births by a randomly selected mother aged 30 years or more with more than three children with a length of gestation of at most 39.5 weeks 639 of these female births are expected to have a length of gestation of over 39.5 weeks. Thus, the estimated odds ratio is

This means that for every 1000 most recent female births with a length of gestation of over 39.5 weeks we would expect about 997 most recent male birth by mothers aged 30 years or over and with more than three children to also exceed a length of gestation of 39.5 weeks. The probability that a randomly selected mother aged 25 years or less with a parity of at most one child has a male birth after 39.5 weeks of gestation in Equation 19 is

The most complementary probability is

In other words, the most recent birth by a randomly selected mother aged 25 years or less with a parity of at most one child, if male has a probability of 37.5 percent of being born after and a probability of 62.5 percent of being born before a gestation period of 39.5weeks. The corresponding odds is

That is for every 1000 most recent male births by mothers aged at most 25 years with not more than one child following a gestation period of at most 39.5 weeks there are 60 male births by these women with a gestation period of over 39.5 weeks. If the most recent birth by the randomly selected mother is female, we set  in Equation 19 and hence in Equation 21 to obtain the odds that the most recent female birth by the randomly selected mother has a gestation period of over 39.5 weeks as

Therefore, the corresponding odds ratio for this event is

This means that for every 1000 female births with a gestation period of over 39.5 weeks, we would expect 995 male births to have a gestation period of also over 39.5 weeks born mothers aged at most 25 years with not more than one child.

Finally, the odds that a randomly selected mother aged 30 years or more with 2 or 3 children has a male child after over 39.5 weeks of gestation in Equation 19 is from Equation 15

Similarly, the odds that the most recent female birth by this randomly selected mother has a gestation period of over 39.5 weeks in Equation 19 is

The corresponding odds ratio is

The odds that the length of gestation for the most recent birth by a randomly selected mother aged 30 years or more with a parity of over 3 children is if child is male, and if child is female.

Hence for the most recent male birth the ratio of the odds that a randomly selected woman aged 30 years or more with a parity of 2 or 3 children has a gestation period of over 39.5 weeks to the odds that her counter-part with more than 3 children has a gestation period of over 39.5 weeks using Equation 16

This means that for every 1000 most recent male births by mothers aged 30 years or more with more than 3 children, there are 1.799 male births by their counterparts and with a parity of two or three children born after over 39.5 weeks of gestation. The odds ratio for female births is 

In other words, for every 1000 female birth by mother aged 30 years or more with a parity of more than 3 children born after over 39.5 weeks of gestation, there are 1.801 female births born by their counterparts with a parity of 2 or 3 children after over 39.5 weeks of gestation. Other probabilities, odds and odds ratios can similarly be estimated.

Conclusion

We have in this paper tried to develop a method of estimating the probabilities, odds and odds ratios of the occurrence of positive responses in dichotomized data using multiple dummy variable regression where both the dependent and independent variables are all binary. This approach enables the interpretation of the resulting partial regression coefficients as probabilities. The proposed method does not require the often-restrictive assumptions of normality and homogeneity usually necessary when the variables used in a regression model are assumed to be continuous. The method is illustrated with some sample data where lengths of gestation for last births is regressed against maternal age, parity and gender of last birth.

References