I want to run a Lasso regression analysis, which will minimize prediction error for my quantitative response varible: student college expectations. I included several categorical and 1 quantitative explanatory in my model. Here is the syntax and output:

The graph and chart show that by far the most important predictor of student expectations are maternal expectations. The best model includes the variables mom college expecations, hispanic ethnicity, mom talked school other and mom worked on school project. Being Hispanic or Biracial appear to be negatively associated with college expecations, while the other variables are postivitely associated.

]]>As in the previous assignment, my target variable is student college expectations. I binned student colleg expectations to make a two level variable were low expectations=2 and high expectations=1. I want to know more about the effect of several variables, all categorical, on this target variable: ethnicity, mom talked school grades, mom talked school other, mom worked school project, and mom's college expectations. I ran the PROC HPFOREST step. Here is my code and output:

There is a misclassification rate of 23.9%, meaning that random forest correctly classifies relations 76% of the time. The output also shows that Mom's college expectations (H1WP11), Mom talked school other (H1WP17J), being hispanic (H1G14), and mom talked grades (H1WP17H) all predict students college expectations. Conversely, the fact that the subject is biracial, white or native american is less important in predicting students college expectaions.

]]>Since the assignment asked us to run a classification tree, I started by binning my categorical response variable into groups using the variable "highly likely" (college expectations levels 1-3 were coded to 2 meaning no and levels 4-5 were coded to 1 meaning yes)

Then I ran the proc hpsplit step using a random seed number. I included the following 2-level explanatory variables in my model: various ethnicities, mom talked school other, mom worked school project, mom talked grades. Here is the code and output:

The decision tree shows the highest likelihood for low college expectations for individuals whose parents did not talk to them about school and who are Hispanic (46%) or biracial (32%) . The lowest likelihood of low college expectations was for individuals whose parent did talk to them about school other (18 %) and those who were not Hispanic, whose parents did talk to them about grades (24%).

After pruning, the decision tree has 12 leaves and predicts the value 1, meaning yes, for the target variable highly likely to attend college.

This plot shows cross-validated average standard error by number of leaves created for each tree. The tree with the lowest ASE has 13 leaves.

The model correctly classifies 99.8% of those who think they are highly likely to go to college, and but slightly more than 1 percent of those who are not highly likely to go to college.

This chart helps explain how SAS selected which variables would be included in the model. The importance scores indicates how important each variable is for predicting the outcome of the target variable. With similar groups of variables, like ethnicity, only some were included.

Running the model with a couple different random seed numbers yielded the same result.

]]>Running the model with a couple different random seed numbers yielded the same result.

For this assignment, I'm going to consider the relationships between my 3 maternal at-home involvement strategies, all binary categorical variables.

The logistic regression output shows that there is a significant relationship between mom talked grades and mom worked on a school project (p=<.0001). The odds ratio is 6.662 (95% CI 5.37-8.36), meaning that a mom who talked about grades is about 6x more likely to have also worked on a school project.

I want to test for possible confounders. After controlling for mom talked school other, I found that a mother who talked grades is around 5.85x more like to work on a school project (95% CI 4.6-7.4). Mom talked grades is still significantly related to mom worked on school project, suggesting mom talked school other is not a confounder. The confidence intervals for the 2 variables do not over lap, meaning I can also conclude that mom talked grades is more closely related to working on a school project than mom talked school other

]]>In the previous lesson, I tested the relationship between mom talked grades, a specific maternal at-home involvement strategy, and student college expectations using linear regression analysis and found a significant relationship (p=<.0001). However, r-squared, representing the percent of the variability in y explained by x, was very low at .00945. Now, I want to add more variables to my model to better explain the variability in y.

I want to test the effect of 2 other at-home involvement strategies, mom talked school other (H1WP17I) and mom helped with a school project (H1WP17J), both categorial variables with 2 levels that already include a 0 value. I also want to test the effect of mother's college expectations, considered a quantitative variable. Because this variable does not include a 0 value, I centered the data (H1WP11_c).

Here is output:

I found that after controlling for the influence of the variables, student college expectations are significantly, positively related to mom talked school other (p=0.03, beta=0.9), mom worked on school project (p=.0001, beta=.16) and mom's college expectations (p=<.0001, beta=0.3). However, my original variable, mom talked grades, no longer shows a significant relationship to student college expectations (p=0.63).

*Confounders*

I needed to check if any of the added variables is a confounder of the association between mom talked grades and student college expectations. I added each variable to my PRAC GLM step, one by one. Each time, I found that mom talked grades had a significant p-value. Again, when I added both mom talked school other and mom worked project, and both mom worked project and mom's college expectations to the model, the p-value for mom talked grades was still significant. However, when I added mom's college expectations and mom talked school other, mom talked grades (p=.48) was not significant. Therefore, when controlling for the effects of both mom's college expectations and mom talked school other, mom talked grades does not have a significant relationship with student college expectations.

I needed to check if any of the added variables is a confounder of the association between mom talked grades and student college expectations. I added each variable to my PRAC GLM step, one by one. Each time, I found that mom talked grades had a significant p-value. Again, when I added both mom talked school other and mom worked project, and both mom worked project and mom's college expectations to the model, the p-value for mom talked grades was still significant. However, when I added mom's college expectations and mom talked school other, mom talked grades (p=.48) was not significant. Therefore, when controlling for the effects of both mom's college expectations and mom talked school other, mom talked grades does not have a significant relationship with student college expectations.

I want to learn more about the relationship between student college expectations and maternal college expectations. First, I used the GLM procedure to test the linear relationship between the two. The significant p-value and positive parameter estimate shows that maternal college expectations are positively associated with student college expectations.

Next, I tried a quadratic model. The p-value for H1WP11 squared was negative, suggesting that the fit line starts low, curve upwards and then curves down again. The r-squared value also increase from about 12% to about 13% .

Next, I want to evaluate the fit of this model.

I want to use a linear regression model to look at the relationship between a specific at-home involvement strategy, mom talked about grades, (categorical variable with 2-levels) and student college expectations (considered quantitative, scaled 1-5). My explanatory variable already includes a 0 value (indicating that the resident mother did not engage in the at-home involvement strategy).

The results of the linear regression model indicated that mom talking about grades (Beta=0.229, p=<.0001) was significantly and positively associated with student college expectations.

The regression equation would be y=4.023+.229x. That means that a person whose mother did not talk to him or her about grades would have a predicted college expectation of 4.023, while a person whose mother did talk to him or her about grades would have a predicted expectation of 4.252.

It's important to note, however, despite that the highly significant p-value, r-squared, representing the percent of the variability in y explained by x, is very low at .00945.

]]>The regression equation would be y=4.023+.229x. That means that a person whose mother did not talk to him or her about grades would have a predicted college expectation of 4.023, while a person whose mother did talk to him or her about grades would have a predicted expectation of 4.252.

It's important to note, however, despite that the highly significant p-value, r-squared, representing the percent of the variability in y explained by x, is very low at .00945.

The National Longitudinal Study of Adolescent to Adult Health (Add Health) is a longitudinal study that seeks to measure health changes of a sample group over time. Participants (N=90,118) are a representative sample from 80 high schools across the US who were adolescents in grades 7-12 during the 1994-1995 school year. The dataset includes respondents from different racial groups and across SES categories. Researchers have continued to observe the cohort into adulthood, conducting four sets of in-home interviews. The most recent interview was conducted in 2008, when the cohort was between 24-32 years old.

The data used for my study included respondents from the publicly available sample population who indicated that they had a resident mother and were currently enrolled in a school with a 7-12 grade system during the initial wave of data collection (N=6,504).

The data was collected in 5 waves. Wave I (1994-1995) involved in-school surveys with students and school administration, as well as in-home interviews with students and parents. In-school surveys were self-administered. About 200 students were selected to create a representative sample from each of the 80 high schools for the in-home interview. Questions were either read by an interviewer or heard through headphones, depending on the sensitivity of the topic. Responses were recorded on a laptop computer. Parents were also asked to complete an interviewer-assisted questionnaire. The questions in Wave I are primarily concerned with how various factors-social, behavioral and environment- are related to adolescent health.

Wave II (1996) included in-school surveys of school administration as well as in-home follow-ups with the previous cohort. Wave III (2001-2002) included in-home interviews with the cohorts’ romantic partners, as well as a follow-up interview. Wave IV (2007-2008) included a follow-up in-home interview with respondents and incorporated many new topics given new research needs. Wave V will be conducted from 2016-2018, with an expanded focus on social, behavioral and biological linkages to the cohort member’s health trajectories, particularly the development of chronic disease.

My response variable is students' expectation that they will attend college, which was reported using a scale from 1-5, where 1 represents low college expectations.

I'm looking at 2 sets of explanatory variables. The first is maternal college expectations, which was measured through the respondents' answers to the question: "How disappointed would your mother be if you did not attend college?" Again, this was measured on a 5 point scale.

The second group of explanatory variables have to do with maternal engagement in at-home school involvement strategies (talking about grades, talking about school other, and/or helping with a school project). Each of these were coded dichotomously with 1 indicating that the resident mother had engaged in the strategy in the last four weeks and 2 indicating that she had not.

I coded out all respondents who had a legitimate skip for questions regarding their resident mother (meaning they did not have a resident mother) as well as those who responded "I don't know" to questions regarding the resident mother's behavior. I also coded out all respondents who were not enrolled in school, who attended a school that did not have a 7-12 grade system, or who responded "I don't know" to a question asking about their grade, limiting the sample to students who were currently attending school.

]]>

In a previous assignment, I used ANOVA to test the relationship between ethnicity (categorical) and student's mean college expectations (quantitative) and found a significant relationship. Now, I want to see if maternal expectations (categorical) is a moderating factor. I ran a separate test for each of the 5 levels of maternal expectations.

I found that for categories 2 and 3 (indicate the second and third lowest level of maternal disappointment if the student did not attend college), the p-value was not significant, suggesting that there was no relationship between ethnicity and student expectations. However, for level 1, 4, and 5 the p-values were significant.

The graphs below show the mean student expectation, by maternal expectations, for each ethnicity. (1=biracial, 2=hispanic, 3=white, 4=black, 5=native american 6=asian). We can see that data tend to follow a similar pattern across maternal expectation levels, with asian students have the highest mean in each category. Interesting, however, at the lowest level of maternal expectations, white students have a relatively lower mean compared to black and biracial students, whereas they have the second highest mean expectation when maternal expectations are high (4 or 5).

This data suggests that maternal expectations moderate the association between ethnicity and college expectations.

The graphs below show the mean student expectation, by maternal expectations, for each ethnicity. (1=biracial, 2=hispanic, 3=white, 4=black, 5=native american 6=asian). We can see that data tend to follow a similar pattern across maternal expectation levels, with asian students have the highest mean in each category. Interesting, however, at the lowest level of maternal expectations, white students have a relatively lower mean compared to black and biracial students, whereas they have the second highest mean expectation when maternal expectations are high (4 or 5).

This data suggests that maternal expectations moderate the association between ethnicity and college expectations.

I'm also interested in the relationship between maternal at home involvement strategies and student expectations. It occurred to me that if a mother would not be disappointed if her child does not attend college, her engagement in at-home involvement strategies may actually have a negative effect on student's college expectations.

To test this, I ran an ANOVA test for each of the at home involvement strategies (talk about grades, work on a project, talk about school other) with maternal expectations as a moderator.

To test this, I ran an ANOVA test for each of the at home involvement strategies (talk about grades, work on a project, talk about school other) with maternal expectations as a moderator.

I did not find that at-home maternal involvement negative effect on student expectations at the lowest level of maternal expectations. In fact, the p-value for level 1 and 2 for 2 strategies (mom talked grades and mom talked school other) was not significant. This could suggest that engaging in these at home involvement strategies is not correlated to an increase in student college expectations when maternal expectations are low.

On the other hand, the p-value for level 1 and 2 was significant for the last at home involvement strategy, helping with a school project. For both levels, it corresponded to a higher mean college expectation (3.1 and 3.5 for level 1, 3.4 and 3.9 for level 2). This might suggest that working on a school project is correlated to a higher college expectation for students, even when the maternal college expectations are low. However, other potential contributing factors need to be considered before causation can be declared.

]]>On the other hand, the p-value for level 1 and 2 was significant for the last at home involvement strategy, helping with a school project. For both levels, it corresponded to a higher mean college expectation (3.1 and 3.5 for level 1, 3.4 and 3.9 for level 2). This might suggest that working on a school project is correlated to a higher college expectation for students, even when the maternal college expectations are low. However, other potential contributing factors need to be considered before causation can be declared.

Since most of my variables are quantitative, I chose to evaluate 2 3+ category categorical variables, maternal expectations and student college expectations:

The association between Mom's disappointment and student expectations has a correlation coefficient of 0.35 with a significant p value of .0001. There is a significant but somewhat weak correlation between the 2 variables.

]]>I first ran a Chi Square test on 2 categorical variables, ethnicity and a specific maternal at a home involvement strategy: mom talked to the student about grade. However, I found that I could not reject the null hypothesis conclude that the 2 categorical variables were related as the p-value was equal to 0.2935.

Instead, I decided to compare 2 categorical variable related to my overall question: maternal expectations and maternal engagement in an at home involvement strategy, again talking about grades. With a p-value of <.0001, I was able to reject the null and conclude that the 2 variables are related.

Because my explanatory variable, has 5 levels, I have to run a posthoc Bonferroni Adjustment to understand why the null can be rejected (i.e. in what ways the rates differ across categories). First, I divided p=.05/10 comparisons to get the adjusted p-value of .005. After running the test, I found that the rates for category 1 were significantly different from the rates for 3, 4, and 5. Category 2 and 3 were significantly different than category 5. In other words the categories could be compared with the following markers 1 (A) 2 (A,B) 3 (B) 4 (B,C) 5 (C).

This suggests that there is a meaningful difference in maternal engagement in this at home involvement strategy based on mom's college expectations.

]]>This suggests that there is a meaningful difference in maternal engagement in this at home involvement strategy based on mom's college expectations.