19. Effect of Problem with Study's Values for Labor Force
As mentioned in the prior section, the study appears to be using total population for the labor force. This became apparent when using a new version of the replication program that calculates the labor force rather than using the study's version of it. Following is the output from the program ipcPSSrm.R which uses the study's data file:
[1] "1990-2010 FOR 219 METAREAS USING STUDY'S FORMULA MINUS INSTRUMENT VARIABLES, FILE=PSS_Data_pss.csv " [1] "" [1] " N INTERCEPT SLOPE STUDY % DIFF S.E. T-STAT P-VAL DESCRIPTION" [1] "-- --------- -------- ------- ------- ------- ------- ------- -----------------------------------" [1] " 1) -0.0830 3.3526 6.6500 -49.58 3.083 1.088 0.277 Weekly Wage, Native STEM" [1] " 2) 0.0946 4.4874 8.0300 -44.12 1.654 2.713 0.007 Weekly Wage, Native College-Educated" [1] " 3) 0.1504 0.9759 3.7800 -74.18 0.970 1.006 0.315 Weekly Wage, Native Non-College-Educated" [1] " 4) 0.0055 0.8456 0.5300 59.54 0.130 6.527 0.000 Employment, Native STEM" [1] " 5) 0.0545 4.7156 2.4800 90.14 0.632 7.466 0.000 Employment, Native College-Educated" [1] " 6) 0.0958 1.5815 -5.1700 -130.59 1.560 1.014 0.311 Employment, Native Non-College-Educated"As can be seen, this is identical to the third table in Section 17 which was generated using tab5_pss.R. However, following is the output from ipcREPrm.R which uses the file PSS_Data.csv. This file replicates the six key dependent variables and one key independent variable data but uses the study's labor force:
[1] "1990-2010 FOR 219 METAREAS USING STUDY'S FORMULA MINUS INSTRUMENT VARIABLES, FILE=PSS_Data_rep.csv " [1] "" [1] " N INTERCEPT SLOPE STUDY % DIFF S.E. T-STAT P-VAL DESCRIPTION" [1] "-- --------- -------- ------- ------- ------- ------- ------- -----------------------------------" [1] " 1) -0.0752 1.8495 6.6500 -72.19 3.142 0.589 0.556 Weekly Wage, Native STEM" [1] " 2) 0.0950 3.8823 8.0300 -51.65 1.684 2.305 0.022 Weekly Wage, Native College-Educated" [1] " 3) 0.1518 0.7883 3.7800 -79.15 0.973 0.811 0.418 Weekly Wage, Native Non-College-Educated" [1] " 4) 0.0062 0.8036 0.5300 51.61 0.136 5.922 0.000 Employment, Native STEM" [1] " 5) 0.0569 4.6493 2.4800 87.47 0.668 6.955 0.000 Employment, Native College-Educated" [1] " 6) 0.1018 1.1463 -5.1700 -122.17 1.646 0.697 0.486 Employment, Native Non-College-Educated"This is identical to the output generated using tab5_rep.R. The difference between this and the prior table is due to the relatively small differences between the study's data and the replicated data. As can be seen, the slopes are all still positive and the second, fourth, and fifth p-values are still significant.
The following table is output by ipc90_10rm_2dp.R and uses the file IP_Metro_rep9010_2dp.csv:
[1] "1990-2010 FOR 219 METAREAS USING STUDY'S FORMULA MINUS INSTRUMENT VARIABLES, FILE=IP_Metro_rep9010_2dp.csv " [1] "" [1] " N INTERCEPT SLOPE STUDY % DIFF S.E. T-STAT P-VAL DESCRIPTION" [1] "-- --------- -------- ------- ------- ------- ------- ------- -----------------------------------" [1] " 1) -0.0689 0.9983 6.6500 -84.99 2.904 0.344 0.731 Weekly Wage, Native STEM" [1] " 2) 0.0992 3.2297 8.0300 -59.78 1.499 2.154 0.032 Weekly Wage, Native College-Educated" [1] " 3) 0.1521 0.7893 3.7800 -79.12 0.892 0.885 0.377 Weekly Wage, Native Non-College-Educated" [1] " 4) 0.0066 0.8339 0.5300 57.35 0.135 6.164 0.000 Employment, Native STEM" [1] " 5) 0.0611 4.6283 2.4800 86.63 0.674 6.868 0.000 Employment, Native College-Educated" [1] " 6) 0.1076 1.0957 -5.1700 -121.19 1.644 0.666 0.506 Employment, Native Non-College-Educated"The file was created using ipd90_10rm_2dp.R and uses labor force figures extracted from IPUMS. As can be seen, using the proper values for labor force cause 4 of the 6 slopes to be smaller but by a relatively small amount. In addition, the 2nd, 4th, and 5th p-values are still significant. The _2dp at the end of the file and program names indicate that the program uses 2 decimal places for the deflators, same as the study. The following table is output by ipc90_10rm.R and uses the file IP_Metro_rep9010.csv:
[1] "1990-2010 FOR 219 METAREAS USING STUDY'S FORMULA MINUS INSTRUMENT VARIABLES, FILE=IP_Metro_rep9010.csv " [1] "" [1] " N INTERCEPT SLOPE STUDY % DIFF S.E. T-STAT P-VAL DESCRIPTION" [1] "-- --------- -------- ------- ------- ------- ------- ------- -----------------------------------" [1] " 1) -0.0713 0.9933 6.6500 -85.06 2.905 0.342 0.733 Weekly Wage, Native STEM" [1] " 2) 0.0970 3.2222 8.0300 -59.87 1.500 2.149 0.032 Weekly Wage, Native College-Educated" [1] " 3) 0.1502 0.7877 3.7800 -79.16 0.892 0.883 0.378 Weekly Wage, Native Non-College-Educated" [1] " 4) 0.0066 0.8339 0.5300 57.35 0.135 6.164 0.000 Employment, Native STEM" [1] " 5) 0.0611 4.6283 2.4800 86.63 0.674 6.868 0.000 Employment, Native College-Educated" [1] " 6) 0.1076 1.0957 -5.1700 -121.19 1.644 0.666 0.506 Employment, Native Non-College-Educated"The file was created using ipd90_10rm.R and again uses labor force figures extracted from IPUMS. The only difference from the prior table is that the program uses 8 decimal places for the deflators, giving them more accuracy. As can be seen, this has an extremely minor effect on the wages.
20. Updating the Study Through 2013
As mentioned previously, one benefit of replicating the data from its original source is that it allows alternate data or an expanded range of data to be analyzed using the same or modified methods. However, a review of the IPUMS data available turned up one obstacle. This obstacle is summarized by the following table:
KEY GEOGRAPHICAL VARIABLES AVAILABLE IN IPUMS SINCE 1980 Variable Label ACS 3% ACS 5% 5% 5% -------- ------------------- ------------------------------------------------------ ------------------ YEAR Year 13 12 11 10 09 08 07 06 05 04 03 02 01 00 10 05 00 90 80 STATEFIP State (FIPS code) X X X X X X X X X X X X X X X X X X X METAREA Metropolitan area . . X X X X X X X . X . . . X X X X X MET2013 Metropolitan area, X X . . . . . . . . . . . . . . . . . 2013 OMB delineationsThis table is derived from data at this link and other data on the IPUMS web site. It shows that there is ACS (American Community Survey) data for every year from 2000 to 2013 but that METAREA data is not available for all of those years. One solution is to look at STATEFIP instead of METAREA. The R program ip1data.R is designed to look at any properly-formatted data extracted from IPUMS. It can be called by another R program which specifies the files and years to be read and various other parameters. Among these other parameters is one to specify whether STATEFIP or METAREA is to be used for designating the geographic area. For example, following is the contents of ipd00_13rs.R:
IP_Data_txt <- "ACS_State_rep0013.txt" IP_Data_csv <- "ACS_State_rep0013.csv" years <- c(2000, -2013) file_prefix <- "acs_" file_suffix <- ".dta" useStates <- TRUE useIncome <- "rep" source("ip1data.R")The first two lines give the names of the txt and csv files to which the resulting data is to be written. The variable years lists the years for which input data files are to be read. The -2013 indicates that there are files for all of the years from 2000 to 2013. If years had been set to c(2000, 2013), then the program would have only attempted to read files for 2000 and 2013, not the years in between. In any case, the years are expected to be contained within the file names, preceded by file_prefix and followed by file_suffix. The variable useStates indicates that the data is to be aggregated by states instead of metareas. The variable useIncome can be "rep", "pos", or "all" and indicates whether the data should include the groups used in the study (rep), just employed workers making positive incomes (pos), or all employed workers, even those with incomes of zero (all).
The results of applying the study's methods to this data can be obtained with the R program ipc00_13rs.R. That program calls the R program ip1calc.R and contains the following statements:
IP_Data_file <- "ACS_State_rep.csv" years <- c(2000, -2013) useStates <- TRUE useMets219 <- FALSE source("ip1calc.R")The parameters are similar to those for ipd00_13rs.R with the addition of useMets219. This only has an effect if using metareas and indicates that only the 219 metareas used in the study should be use. Following is the output of the program:
[1] "2000-2013 FOR STATES USING STUDY'S FORMULA MINUS INSTRUMENT VARIABLES, FILE=ACS_State_rep0013.csv " [1] "" [1] " N INTERCEPT SLOPE STUDY % DIFF S.E. T-STAT P-VAL DESCRIPTION" [1] "-- --------- -------- ------- ------- ------- ------- ------- -----------------------------------" [1] " 1) 0.0193 -2.7436 6.6500 -141.26 4.650 -0.590 0.555 Weekly Wage, Native STEM" [1] " 2) -0.0137 -1.9274 8.0300 -124.00 1.724 -1.118 0.264 Weekly Wage, Native College-Educated" [1] " 3) 0.0045 1.7845 3.7800 -52.79 1.253 1.424 0.155 Weekly Wage, Native Non-College-Educated" [1] " 4) 0.0032 0.1846 0.5300 -65.16 0.130 1.416 0.157 Employment, Native STEM" [1] " 5) 0.0036 -0.1354 2.4800 -105.46 0.327 -0.414 0.679 Employment, Native College-Educated" [1] " 6) 0.0018 -0.3846 -5.1700 -92.56 0.461 -0.833 0.405 Employment, Native Non-College-Educated"These results differ a good deal from those of the study, at least for wages. As can be seen, the output shows a negative association between foreign workers and wages for native STEM and College-Educated workers. Employment is more similar to the study's results with a positive association for STEM and College_Educated employment and a negative association with Non-College-Educated employment. Unlike the study, none of the p-values are significant.
21. Looking at Metareas with More Granularity
One of the problems with the study is that each metarea has only 3 data points (1990-2000, 2000-2005, and 2005-2010). This is especially a problem since metarea is used as a dummy variable, meaning that each metarea has its own coefficient. Hence, if one wishes to use metareas like the study, it would be useful to increase the number of data points. As can be seen from the table of geographical variables above, the longest recent period of continuous metareas is 2005-2011. Although this is a shorter span than the study, it provides twice as many data points per metarea.
The R program ipd05_11rm.R extracts the data for this period and ipc05_11rm.R calculates the regression results. Following are those results:
[1] "2005-2011 FOR 219 METAREAS USING STUDY'S FORMULA MINUS INSTRUMENT VARIABLES, FILE=ACS_Metro_rep0511.csv " [1] "" [1] " N INTERCEPT SLOPE STUDY % DIFF S.E. T-STAT P-VAL DESCRIPTION" [1] "-- --------- -------- ------- ------- ------- ------- ------- -----------------------------------" [1] " 1) -0.3092 -1.2540 6.6500 -118.86 1.507 -0.832 0.406 Weekly Wage, Native STEM" [1] " 2) -0.0893 -0.2769 8.0300 -103.45 0.770 -0.360 0.719 Weekly Wage, Native College-Educated" [1] " 3) -0.0178 -0.5854 3.7800 -115.49 0.542 -1.079 0.281 Weekly Wage, Native Non-College-Educated" [1] " 4) -0.0187 -0.1479 0.5300 -127.90 0.081 -1.822 0.069 Employment, Native STEM" [1] " 5) 0.0294 -0.1925 2.4800 -107.76 0.179 -1.075 0.283 Employment, Native College-Educated" [1] " 6) -0.0454 0.0447 -5.1700 -100.87 0.263 0.170 0.865 Employment, Native Non-College-Educated"As before, these results differ a good deal from those of the study. As can be seen, the output shows a negative association between foreign workers and the wages of all native workers. Interestingly, the signs of the slopes for employment are all opposite of what they are for the study. Hence, the results show a negative association between foreign workers and the employment of native STEM and college-educated workers. The p-value for the former item, native STEM employment, is signficant.
22. Summary of Applying the Study's Methods to Updated and/or Expanded Data Sets
The results from the prior two sections suggest that the study's results are heavily influenced by the years being studied and perhaps the type of geographical identifier (metarea or state) being used. This is similar to the result of an analysis of a Madeline Zavodny study which can be seen at this link. Hence, it would appear to be a mistake to study one time period and assume that its results apply to all time periods. It would seem prudent to apply the same method to numerous time periods to see if the results are consistent over time.
It would be useful to also replicate the bartik instrument variables and the instrument variables for H-1B imputed growth so that they could likewise be applied to updated and expanded data sets. In fact, this points to a problem with the approach that many sources take to replication. The replication data for this study and the Zavodny study both started with a data file of data that had come from a public source but with no code showing how it was extracted and/or calculated. In the case of this study, that data file contained the final figures for the bartik and H-1B imputed growth instruments. The researcher is left to either accept these figures or to replicate them based strictly on whatever description of them might exist is the text of the study or any appendices. If that had been done in this case, the use of total population for the labor force or the difference in the population used between wages and employment would not have been detected. More importantly perhaps, it would have been impossible to take the important step of applying the study's methods to other time periods or check the result of other minor changes. Hence, it would seem that replication information should include all of the code required to go from the publicly available data to the study's final results. It should not start with a data file that was reportedly created from that public data.
In addition, it would be helpful if a version of the replication code written in R can be provided. Being a free software environment, R is available to a much wider audience. The replication code for this and the Zavodny study was in Stata which is understandable since Stata is one of the primary environments used by the academic community. Still, the result is that such studies receive much less scrutiny. This could be addressed by finding a way to make the replication code available in R.
Source Code for R Programs and Data Files Used in on this Page