Analysis of "STEM Workers, H-1B Visas, and Productivity in US Cities"

19. Effect of Problem with Study's Values for Labor Force

As mentioned in the prior section, the study appears to be using total population for the labor force. This became apparent when using a new version of the replication program that calculates the labor force rather than using the study's version of it. Following is the output from the program ipcPSSrm.R which uses the study's data file:

[1] "1990-2010 FOR 219 METAREAS USING STUDY'S FORMULA MINUS INSTRUMENT VARIABLES, FILE=PSS_Data_pss.csv "
[1] ""
[1] " N  INTERCEPT    SLOPE    STUDY    % DIFF     S.E.   T-STAT    P-VAL  DESCRIPTION"
[1] "--  ---------  --------  -------  -------  -------  -------  -------  -----------------------------------"
[1] " 1)   -0.0830    3.3526   6.6500   -49.58    3.083    1.088    0.277  Weekly Wage, Native STEM"
[1] " 2)    0.0946    4.4874   8.0300   -44.12    1.654    2.713    0.007  Weekly Wage, Native College-Educated"
[1] " 3)    0.1504    0.9759   3.7800   -74.18    0.970    1.006    0.315  Weekly Wage, Native Non-College-Educated"
[1] " 4)    0.0055    0.8456   0.5300    59.54    0.130    6.527    0.000  Employment, Native STEM"
[1] " 5)    0.0545    4.7156   2.4800    90.14    0.632    7.466    0.000  Employment, Native College-Educated"
[1] " 6)    0.0958    1.5815  -5.1700  -130.59    1.560    1.014    0.311  Employment, Native Non-College-Educated"

As can be seen, this is identical to the third table in Section 17 which was generated using tab5_pss.R. However, following is the output from ipcREPrm.R which uses the file PSS_Data.csv. This file replicates the six key dependent variables and one key independent variable data but uses the study's labor force:

[1] "1990-2010 FOR 219 METAREAS USING STUDY'S FORMULA MINUS INSTRUMENT VARIABLES, FILE=PSS_Data_rep.csv "
[1] ""
[1] " N  INTERCEPT    SLOPE    STUDY    % DIFF     S.E.   T-STAT    P-VAL  DESCRIPTION"
[1] "--  ---------  --------  -------  -------  -------  -------  -------  -----------------------------------"
[1] " 1)   -0.0752    1.8495   6.6500   -72.19    3.142    0.589    0.556  Weekly Wage, Native STEM"
[1] " 2)    0.0950    3.8823   8.0300   -51.65    1.684    2.305    0.022  Weekly Wage, Native College-Educated"
[1] " 3)    0.1518    0.7883   3.7800   -79.15    0.973    0.811    0.418  Weekly Wage, Native Non-College-Educated"
[1] " 4)    0.0062    0.8036   0.5300    51.61    0.136    5.922    0.000  Employment, Native STEM"
[1] " 5)    0.0569    4.6493   2.4800    87.47    0.668    6.955    0.000  Employment, Native College-Educated"
[1] " 6)    0.1018    1.1463  -5.1700  -122.17    1.646    0.697    0.486  Employment, Native Non-College-Educated"

This is identical to the output generated using tab5_rep.R. The difference between this and the prior table is due to the relatively small differences between the study's data and the replicated data. As can be seen, the slopes are all still positive and the second, fourth, and fifth p-values are still significant.

The following table is output by ipc90_10rm_2dp.R and uses the file IP_Metro_rep9010_2dp.csv:

[1] "1990-2010 FOR 219 METAREAS USING STUDY'S FORMULA MINUS INSTRUMENT VARIABLES, FILE=IP_Metro_rep9010_2dp.csv "
[1] ""
[1] " N  INTERCEPT    SLOPE    STUDY    % DIFF     S.E.   T-STAT    P-VAL  DESCRIPTION"
[1] "--  ---------  --------  -------  -------  -------  -------  -------  -----------------------------------"
[1] " 1)   -0.0689    0.9983   6.6500   -84.99    2.904    0.344    0.731  Weekly Wage, Native STEM"
[1] " 2)    0.0992    3.2297   8.0300   -59.78    1.499    2.154    0.032  Weekly Wage, Native College-Educated"
[1] " 3)    0.1521    0.7893   3.7800   -79.12    0.892    0.885    0.377  Weekly Wage, Native Non-College-Educated"
[1] " 4)    0.0066    0.8339   0.5300    57.35    0.135    6.164    0.000  Employment, Native STEM"
[1] " 5)    0.0611    4.6283   2.4800    86.63    0.674    6.868    0.000  Employment, Native College-Educated"
[1] " 6)    0.1076    1.0957  -5.1700  -121.19    1.644    0.666    0.506  Employment, Native Non-College-Educated"

The file was created using ipd90_10rm_2dp.R and uses labor force figures extracted from IPUMS. As can be seen, using the proper values for labor force cause 4 of the 6 slopes to be smaller but by a relatively small amount. In addition, the 2nd, 4th, and 5th p-values are still significant. The _2dp at the end of the file and program names indicate that the program uses 2 decimal places for the deflators, same as the study. The following table is output by ipc90_10rm.R and uses the file IP_Metro_rep9010.csv:

[1] "1990-2010 FOR 219 METAREAS USING STUDY'S FORMULA MINUS INSTRUMENT VARIABLES, FILE=IP_Metro_rep9010.csv "
[1] ""
[1] " N  INTERCEPT    SLOPE    STUDY    % DIFF     S.E.   T-STAT    P-VAL  DESCRIPTION"
[1] "--  ---------  --------  -------  -------  -------  -------  -------  -----------------------------------"
[1] " 1)   -0.0713    0.9933   6.6500   -85.06    2.905    0.342    0.733  Weekly Wage, Native STEM"
[1] " 2)    0.0970    3.2222   8.0300   -59.87    1.500    2.149    0.032  Weekly Wage, Native College-Educated"
[1] " 3)    0.1502    0.7877   3.7800   -79.16    0.892    0.883    0.378  Weekly Wage, Native Non-College-Educated"
[1] " 4)    0.0066    0.8339   0.5300    57.35    0.135    6.164    0.000  Employment, Native STEM"
[1] " 5)    0.0611    4.6283   2.4800    86.63    0.674    6.868    0.000  Employment, Native College-Educated"
[1] " 6)    0.1076    1.0957  -5.1700  -121.19    1.644    0.666    0.506  Employment, Native Non-College-Educated"

The file was created using ipd90_10rm.R and again uses labor force figures extracted from IPUMS. The only difference from the prior table is that the program uses 8 decimal places for the deflators, giving them more accuracy. As can be seen, this has an extremely minor effect on the wages.

20. Updating the Study Through 2013

As mentioned previously, one benefit of replicating the data from its original source is that it allows alternate data or an expanded range of data to be analyzed using the same or modified methods. However, a review of the IPUMS data available turned up one obstacle. This obstacle is summarized by the following table:

KEY GEOGRAPHICAL VARIABLES AVAILABLE IN IPUMS SINCE 1980

Variable  Label                ACS                                                     3% ACS  5%  5%  5%
--------  -------------------  ------------------------------------------------------  ------------------
    YEAR  Year                 13  12  11  10  09  08  07  06  05  04  03  02  01  00  10  05  00  90  80
STATEFIP  State (FIPS code)     X   X   X   X   X   X   X   X   X   X   X   X   X   X   X   X   X   X   X
METAREA   Metropolitan area     .   .   X   X   X   X   X   X   X   .   X   .   .   .   X   X   X   X   X
MET2013   Metropolitan area,    X   X   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .
      2013 OMB delineations

This table is derived from data at this link and other data on the IPUMS web site. It shows that there is ACS (American Community Survey) data for every year from 2000 to 2013 but that METAREA data is not available for all of those years. One solution is to look at STATEFIP instead of METAREA. The R program ip1data.R is designed to look at any properly-formatted data extracted from IPUMS. It can be called by another R program which specifies the files and years to be read and various other parameters. Among these other parameters is one to specify whether STATEFIP or METAREA is to be used for designating the geographic area. For example, following is the contents of ipd00_13rs.R:

IP_Data_txt <- "ACS_State_rep0013.txt"
IP_Data_csv <- "ACS_State_rep0013.csv"
years <- c(2000, -2013)
file_prefix <- "acs_"
file_suffix <- ".dta"
useStates <- TRUE
useIncome <- "rep"
source("ip1data.R")

The first two lines give the names of the txt and csv files to which the resulting data is to be written. The variable years lists the years for which input data files are to be read. The -2013 indicates that there are files for all of the years from 2000 to 2013. If years had been set to c(2000, 2013), then the program would have only attempted to read files for 2000 and 2013, not the years in between. In any case, the years are expected to be contained within the file names, preceded by file_prefix and followed by file_suffix. The variable useStates indicates that the data is to be aggregated by states instead of metareas. The variable useIncome can be "rep", "pos", or "all" and indicates whether the data should include the groups used in the study (rep), just employed workers making positive incomes (pos), or all employed workers, even those with incomes of zero (all).

The results of applying the study's methods to this data can be obtained with the R program ipc00_13rs.R. That program calls the R program ip1calc.R and contains the following statements:

 IP_Data_file <- "ACS_State_rep.csv"
years <- c(2000, -2013)
useStates <- TRUE
useMets219 <- FALSE
source("ip1calc.R")

The parameters are similar to those for ipd00_13rs.R with the addition of useMets219. This only has an effect if using metareas and indicates that only the 219 metareas used in the study should be use. Following is the output of the program:

[1] "2000-2013 FOR STATES USING STUDY'S FORMULA MINUS INSTRUMENT VARIABLES, FILE=ACS_State_rep0013.csv "
[1] ""
[1] " N  INTERCEPT    SLOPE    STUDY    % DIFF     S.E.   T-STAT    P-VAL  DESCRIPTION"
[1] "--  ---------  --------  -------  -------  -------  -------  -------  -----------------------------------"
[1] " 1)    0.0193   -2.7436   6.6500  -141.26    4.650   -0.590    0.555  Weekly Wage, Native STEM"
[1] " 2)   -0.0137   -1.9274   8.0300  -124.00    1.724   -1.118    0.264  Weekly Wage, Native College-Educated"
[1] " 3)    0.0045    1.7845   3.7800   -52.79    1.253    1.424    0.155  Weekly Wage, Native Non-College-Educated"
[1] " 4)    0.0032    0.1846   0.5300   -65.16    0.130    1.416    0.157  Employment, Native STEM"
[1] " 5)    0.0036   -0.1354   2.4800  -105.46    0.327   -0.414    0.679  Employment, Native College-Educated"
[1] " 6)    0.0018   -0.3846  -5.1700   -92.56    0.461   -0.833    0.405  Employment, Native Non-College-Educated"

These results differ a good deal from those of the study, at least for wages. As can be seen, the output shows a negative association between foreign workers and wages for native STEM and College-Educated workers. Employment is more similar to the study's results with a positive association for STEM and College_Educated employment and a negative association with Non-College-Educated employment. Unlike the study, none of the p-values are significant.

21. Looking at Metareas with More Granularity

One of the problems with the study is that each metarea has only 3 data points (1990-2000, 2000-2005, and 2005-2010). This is especially a problem since metarea is used as a dummy variable, meaning that each metarea has its own coefficient. Hence, if one wishes to use metareas like the study, it would be useful to increase the number of data points. As can be seen from the table of geographical variables above, the longest recent period of continuous metareas is 2005-2011. Although this is a shorter span than the study, it provides twice as many data points per metarea.

The R program ipd05_11rm.R extracts the data for this period and ipc05_11rm.R calculates the regression results. Following are those results:

[1] "2005-2011 FOR 219 METAREAS USING STUDY'S FORMULA MINUS INSTRUMENT VARIABLES, FILE=ACS_Metro_rep0511.csv "
[1] ""
[1] " N  INTERCEPT    SLOPE    STUDY    % DIFF     S.E.   T-STAT    P-VAL  DESCRIPTION"
[1] "--  ---------  --------  -------  -------  -------  -------  -------  -----------------------------------"
[1] " 1)   -0.3092   -1.2540   6.6500  -118.86    1.507   -0.832    0.406  Weekly Wage, Native STEM"
[1] " 2)   -0.0893   -0.2769   8.0300  -103.45    0.770   -0.360    0.719  Weekly Wage, Native College-Educated"
[1] " 3)   -0.0178   -0.5854   3.7800  -115.49    0.542   -1.079    0.281  Weekly Wage, Native Non-College-Educated"
[1] " 4)   -0.0187   -0.1479   0.5300  -127.90    0.081   -1.822    0.069  Employment, Native STEM"
[1] " 5)    0.0294   -0.1925   2.4800  -107.76    0.179   -1.075    0.283  Employment, Native College-Educated"
[1] " 6)   -0.0454    0.0447  -5.1700  -100.87    0.263    0.170    0.865  Employment, Native Non-College-Educated"

As before, these results differ a good deal from those of the study. As can be seen, the output shows a negative association between foreign workers and the wages of all native workers. Interestingly, the signs of the slopes for employment are all opposite of what they are for the study. Hence, the results show a negative association between foreign workers and the employment of native STEM and college-educated workers. The p-value for the former item, native STEM employment, is signficant.

22. Summary of Applying the Study's Methods to Updated and/or Expanded Data Sets

The results from the prior two sections suggest that the study's results are heavily influenced by the years being studied and perhaps the type of geographical identifier (metarea or state) being used. This is similar to the result of an analysis of a Madeline Zavodny study which can be seen at this link. Hence, it would appear to be a mistake to study one time period and assume that its results apply to all time periods. It would seem prudent to apply the same method to numerous time periods to see if the results are consistent over time.

It would be useful to also replicate the bartik instrument variables and the instrument variables for H-1B imputed growth so that they could likewise be applied to updated and expanded data sets. In fact, this points to a problem with the approach that many sources take to replication. The replication data for this study and the Zavodny study both started with a data file of data that had come from a public source but with no code showing how it was extracted and/or calculated. In the case of this study, that data file contained the final figures for the bartik and H-1B imputed growth instruments. The researcher is left to either accept these figures or to replicate them based strictly on whatever description of them might exist is the text of the study or any appendices. If that had been done in this case, the use of total population for the labor force or the difference in the population used between wages and employment would not have been detected. More importantly perhaps, it would have been impossible to take the important step of applying the study's methods to other time periods or check the result of other minor changes. Hence, it would seem that replication information should include all of the code required to go from the publicly available data to the study's final results. It should not start with a data file that was reportedly created from that public data.

In addition, it would be helpful if a version of the replication code written in R can be provided. Being a free software environment, R is available to a much wider audience. The replication code for this and the Zavodny study was in Stata which is understandable since Stata is one of the primary environments used by the academic community. Still, the result is that such studies receive much less scrutiny. This could be addressed by finding a way to make the replication code available in R.

Source Code for R Programs and Data Files Used in on this Page

Part 1 of Analysis of "STEM Workers, H-1B Visas, and Productivity in US Cities"
Short Analysis of "Immigration and American Jobs"
Detailed Analysis of "Immigration and American Jobs"
Analysis of "Foreign STEM Workers and Native Wages and Employment in U.S. Cities"
Information on H-1B Visas
Commentary on the Skills Gap
Go to Budget Home Page