A Look at Labor Condition Application (LCA) Data from 2015 through 2018

The Wikipedia page for Labor Condition Application states the following:

The Labor Condition Application (LCA) is an application filed by prospective employers on behalf of workers applying for work authorization for the non-immigrant statuses H-1B, H-1B1 (a variant of H-1B for people from Singapore and Chile) and E-3 (a variant of H-1B for workers from Australia). The application is submitted to and needs to be approved by the United States Department of Labor Employment and Training Administration (DOLETA)'s Office of Foreign Labor Certification (OFLC). The form used to submit the application is ETA Form 9035.

Load the Labor Condition Application (LCA) Data

Links to disclosure data from these applications can be found by going to the U.S. Department of Labor website and clicking on the Disclosure Data tab. These include links to disclosure data for fiscal years 2015 through 2018 at H-1B_FY2015.xlsx, H-1B_FY2016.xlsx, H-1B_FY2017.xlsx, and H-1B_FY2018.xlsx. The actual names of the files at these links are H-1B_Disclosure_Data_FY15_Q4.xlsx, H-1B_Disclosure_Data_FY16.xlsx, H-1B_Disclosure_Data_FY17.xlsx, and H-1B_FY2018.xlsx. These files need to be downloaded to the same directory as this Python notebook in order for them to be read by the following code. It can take several minutes to load the XLSX files the first time so the code writes the needed subsets of the data out to CSV files which load quickly in subsequent runs.

In [4]:
import pandas as pd
import os.path
import time
pd.set_option('display.width', 120)
#pd.set_option('max_rows', 200)

def load_lca(year):
    fields = ['CASE_STATUS','VISA_CLASS','EMPLOYER_NAME','EMPLOYER_CITY','EMPLOYER_STATE',
              'EMPLOYER_POSTAL_CODE','SOC_NAME','TOTAL_WORKERS','FULL_TIME_POSITION','PW_WAGE_LEVEL',
              'WORKSITE_CITY','WORKSITE_COUNTY','WORKSITE_STATE','WORKSITE_POSTAL_CODE']
    if year == 2015:
        xlsx_file = 'H-1B_Disclosure_Data_FY15_Q4.xlsx'
    elif year == 2016:
        xlsx_file = 'H-1B_Disclosure_Data_FY16.xlsx'
        fields.remove('PW_WAGE_LEVEL') # avoid warning for missing label
    elif year == 2017:
        xlsx_file = 'H-1B_Disclosure_Data_FY17.xlsx'
    elif year == 2018:
        xlsx_file = 'H-1B_FY2018.xlsx'
    else:
        print("ERROR: Invalid year " + str(year))
        return None
    yr = year % 100
    csv_file = 'H-1B_FY'+str(yr)+'.csv'
    if (os.path.isfile(csv_file)):
        start = time.time()
        yy = pd.read_csv(csv_file)
        end = time.time()
        print('Seconds to load: ' + str(end - start))
    else:
        print("Loading " + xlsx_file)
        start = time.time()
        zz = pd.read_excel(xlsx_file)
        end = time.time()
        print('Seconds to load: ' + str(end - start))
        if year == 2015:
            zz['TOTAL_WORKERS'] = zz['TOTAL WORKERS']
            #zz.rename(columns={'TOTAL WORKERS':'TOTAL_WORKERS'})
        yy = zz.loc[:, fields]
        if year == 2016:
            yy['PW_WAGE_LEVEL'] = None # avoid warning for missing label
        yy.to_csv(csv_file)
    return yy

dd15 = load_lca(2015)
dd16 = load_lca(2016)
dd17 = load_lca(2017)
dd18 = load_lca(2018)
Loading H-1B_Disclosure_Data_FY15_Q4.xlsx
Seconds to load: 157.14320731163025
Loading H-1B_Disclosure_Data_FY16.xlsx
Seconds to load: 168.7158179283142
Loading H-1B_Disclosure_Data_FY17.xlsx
Seconds to load: 200.3950138092041
Loading H-1B_FY2018.xlsx
Seconds to load: 135.22230863571167

As can be seen in the output above, it takes the code several minutes to read each of the XLSX file. The code therefore writes the needed subset of data out to CSV files and attempts to read these first in future executions. Surprisingly, it takes only a couple of seconds to read each of the CSV files.

Filtering and Grouping the Data

The following two python functions can be used to filter and group the loaded LCA data. Specifically, function group_lca will filter and group the data loaded from one year's LCA file and function group_lca1518 will call group_lca to filter and group the data for each of years 2015 through 2018 and join the data into a single dataframe. The calls to load_lca following the functions load the data for the four years and the call to group_lca1518 groups the data by CASE_STATUS. The None in the last two arguments specifies that there is no filtering for this call.

In [107]:
def group_lca(gg, label, group, filter, value, inflag=False, numeric=False):
    if filter != None:
        for i in range(len(filter)):
            if inflag:
                gg = gg[gg[filter[i]].str.contains(value[i]) == True]
            else:
                gg = gg[gg[filter[i]] == value[i]]
    gg = gg.groupby(group).agg({'TOTAL_WORKERS':'sum'})
    gg = gg.sort_values(by='TOTAL_WORKERS',ascending=False)
    gg = gg.rename(columns={'TOTAL_WORKERS':label})
    if numeric == False:
        gg[label] = gg.apply(lambda x: "{:,.0f}".format(x[label]), axis=1)
    return gg

def group_lca1518(nn, group, filter, value, inflag=False, numeric=False):
    gg   = group_lca(dd18, '2018', group, filter, value, inflag, numeric)
    gg17 = group_lca(dd17, '2017', group, filter, value, inflag, numeric)
    gg = gg17.join(gg, how='right')
    gg16 = group_lca(dd16, '2016', group, filter, value, inflag, numeric)
    gg = gg16.join(gg, how='right')
    gg15 = group_lca(dd15, '2015', group, filter, value, inflag, numeric)
    gg = gg15.join(gg, how='right')
    gg = gg[0:nn]
    return(gg)

dd15 = load_lca(2015)
dd16 = load_lca(2016)
dd17 = load_lca(2017)
dd18 = load_lca(2018)
ggstat = group_lca1518(10, ['CASE_STATUS'], None, None)
print(ggstat)
Seconds to load: 1.6579041481018066
Seconds to load: 1.477064609527588
Seconds to load: 1.543177604675293
Seconds to load: 1.024740219116211
                          2015       2016       2017     2018
CASE_STATUS                                                  
CERTIFIED            1,161,218  1,186,144  1,070,024  824,239
CERTIFIED-WITHDRAWN     48,854     66,382     63,780   23,827
WITHDRAWN               28,166     45,014     29,854   17,947
DENIED                  20,116     15,694     19,304   11,481

As can be seen, it only took a second or two to load each of the CSV files. The call to group_lca1518 then shows that request for over a million workers were made in each of 2015 through 2017. A bit less than a million workers have been requested so far in 2018. According to the page where the 2018 data was obtained, that file covers determinations issued between October 1, 2017 through March 31, 2018. Hence, there will an additional 6 months of determinations for 2018. In any case, the output shows that the great majority of these requests have the status of CERTIFIED.

Comparing the Prevailing Wage Levels

The "prevailing wage" is the minimum salary that an H-1B worker can be paid and is determined by guidance set forth in this document. It lists four skill levels, described briefly as follows:

  • Level I (entry) wage rates are assigned to job offers for beginning level employees who have only a basic understanding of the occupation.

  • Level II (qualified) wage rates are assigned to job offers for qualified employees who have attained, either through education or experience, a good understanding of the occupation.

  • Level III (experienced) wage rates are assigned to job offers for experienced employees who have a sound understanding of the occupation and have attained, either through education or experience, special skills or knowledge.

  • Level IV (fully competent) wage rates are assigned to job offers for competent employees who have sufficient experience in the occupation to plan and conduct work requiring judgment and the independent evaluation, selection, modification, and application of standard procedures and techniques.

A Computerworld article describes how these are used to determing the prevailing wage as follows:

Congress set four wage level tiers for visa workers. Level 1 is intended for entry-level and is at the 17th percentile. Level 3 is the occupational mean, or the 50th percentile. Level 4 is at the 66th percentile.

From the pattern, I assume that Level 2 is at the 33rd percentile. Hence, Level 3 should set the prevailing wage at the median wage with Levels 1 and 2 setting it to less and Level 4 setting it to more. The following call to group_lca1518 shows the number of workers requested in 2015 through 2018, by wage level. The last two arguments of ['CASE_STATUS'] and ['CERTIFIED'] cause the code to only count certified requests.

In [108]:
gglvl = group_lca1518(10, ['PW_WAGE_LEVEL'], ['CASE_STATUS'], ['CERTIFIED'])
print(gglvl)
                  2015  2016     2017     2018
PW_WAGE_LEVEL                                 
Level II       448,700   NaN  314,696  396,853
Level I        472,283   NaN  346,206  144,379
Level III      121,743   NaN  121,364  138,054
Level IV        61,669   NaN   63,928   68,942

As can be seen, 2015 and 2017 had the most workers requested at Level I, slightly above the number requested at Level II. Well behind those were Level III and Level IV. For 2016, the LCA disclosure data did not contain the wage level. For 2018, the current order of the requests are Level II, Level I, Level III, and Level IV. This of course could change since there will be an additional six months of determinations. However, it may also be due to an increased scrutiny of wage levels described in an article titled United States: Increased Scrutiny Over H-1B Visas: Beware Of Level 1 Wages!. It states:

The recent USCIS memo concludes a Level 1 (entry level) designation for a position, covered under the Computer Programmer occupation classification, would not qualify as a specialty occupation position. As a result, this combination (i.e., Level 1 and the Computer Programmer SOC code) would probably be one to stay away from if trying to show position is a specialty occupation. This is consistent with Senator Grassley's bill and messages to establish more of a merit based system focusing on the best and the brightest over those that are more entry level (i.e., those only deserving of a level 1 wage).

Comparing the Prevailing Wage Levels Further Grouped by Whether It's Full-time

The following code again looks at the wage levels but also includes the setting of FULL_TIME_POSITION in the grouping.

In [109]:
gglvlft = group_lca1518(10, ['PW_WAGE_LEVEL','FULL_TIME_POSITION'], ['CASE_STATUS'], ['CERTIFIED'])
print(gglvlft)
                                     2015  2016     2017     2018
PW_WAGE_LEVEL FULL_TIME_POSITION                                 
Level II      Y                   446,636   NaN  312,930  392,192
Level I       Y                   461,222   NaN  337,221  140,789
Level III     Y                   121,242   NaN  120,937  137,530
Level IV      Y                    61,099   NaN   63,642   68,526
Level II      N                     2,064   NaN    1,766    4,661
Level I       N                    11,061   NaN    8,985    3,590
Level III     N                       501   NaN      427      524
Level IV      N                       570   NaN      286      416

As can be seen, the great majority of the requests are for full-time positions and the ordering of wage levels remains pretty much the same for full-time and part-time positions.

Looking at Which Employers Request the Most H-1B Visas

The following call to group_lca1518 lists the 20 employers who requested the most H-1B workers in 2018.

In [110]:
ggemp = group_lca1518(20, ['EMPLOYER_NAME'], ['CASE_STATUS'], ['CERTIFIED'])
print(ggemp)
                                                    2015     2016    2017     2018
EMPLOYER_NAME                                                                     
ERNST & YOUNG U.S. LLP                             4,115    6,879  12,273  138,660
DELOITTE CONSULTING LLP                          118,880  122,189  93,658   31,972
COGNIZANT TECHNOLOGY SOLUTIONS U.S. CORPORATION   82,611   95,814  50,755   29,693
HCL AMERICA, INC.                                 15,762   12,668  39,668   23,792
APPLE INC.                                         8,789   23,028  27,223   19,925
QUALCOMM TECHNOLOGIES, INC.                        1,656    5,396  11,672   15,612
TATA CONSULTANCY SERVICES LIMITED                 43,231   17,097  18,974   15,563
KFORCE INC.                                        3,162    2,860   7,024   10,553
MPHASIS CORPORATION                               19,662   16,229  21,823   10,403
AMAZON.COM SERVICES, INC.                            NaN      NaN     NaN   10,201
QUALCOMM ATHEROS, INC.                               212    1,937   6,756    8,132
COGNIZANT TECHNOLOGY SOLUTIONS US CORP               NaN      NaN     NaN    8,038
NVIDIA CORPORATION                                 2,450    2,613   8,704    7,797
CAPGEMINI AMERICA INC                                NaN   43,062  16,198    7,567
SYNOPSYS, INC.                                     5,737    4,857   7,398    7,017
CISCO SYSTEMS, INC.                                6,184    6,473   7,583    6,619
INFOSYS LIMITED                                   33,216   25,322  20,565    6,548
WIPRO LIMITED                                     33,048   30,529   9,815    5,450
MICROSOFT CORPORATION                              5,689    9,589   7,136    5,174
ACCENTURE LLP                                     17,509   12,305   7,548    5,170

As can be seen, the employers 'AMAZON.COM SERVICES, INC.' and 'COGNIZANT TECHNOLOGY SOLUTIONS US CORP' have NaN for one or years prior to 2018. The latter employer appears to simply be another name for 'COGNIZANT TECHNOLOGY SOLUTIONS U.S. CORPORATION', the fourth employer in the list. The list does not contain another name for 'AMAZON.COM SERVICES, INC.' but those can be searched for via the following command:

In [111]:
ggamaz = group_lca1518(20, ['EMPLOYER_NAME'], ['EMPLOYER_NAME','CASE_STATUS'], ['AMAZON','CERTIFIED$'], True)
print(ggamaz)
                                    2015   2016    2017    2018
EMPLOYER_NAME                                                  
AMAZON.COM SERVICES, INC.            NaN    NaN     NaN  10,201
AMAZON FULFILLMENT SERVICES, INC.     33     10     123   1,228
AMAZON WEB SERVICES, INC.            439    899   1,253   1,186
AMAZON CORPORATE LLC               6,928  6,394  10,171     139
AMAZON ROBOTICS LLC                    3     78     137      99
AMAZON.COM SERVICES INC              NaN    NaN     NaN      85
AMAZON.COM SERVICES, INC             NaN    NaN     NaN      68
AMAZON MECHANICAL TURK, INC.           2    NaN      11      26
AMAZON FULFILLMENT SERVICES INC      NaN    NaN       2      25
AMAZON WEB SERVICES INC.               5      1       5      22
AMAZON.COM.DEDC, LLC                 363    102     117      21
AMAZON MEDIA GROUP LLC               NaN      4      15      17
AMAZON FULFILLMENT SERVICES, INC     NaN    NaN     NaN      15
AMAZON WEB SERVICES INC               51    NaN      21      15
AMAZON SERVICES LLC                   13      6      13       9
AMAZON REGISTRY SERVICES INC.        NaN    NaN       2       7
AMAZON.COM SERVICES INC.             NaN    NaN     NaN       7
AMAZON FRESH LLC                     NaN    NaN       6       7
AMAZON.COM.DEDC LLC                    6      1       6       6
AMAZON MEDIA GROUP                   NaN      2       5       6

Most of these appear to be alternate names for AMAZON. This makes clear the fact that there are no apparent rules for spell-checking or limiting the names for a company. Hence, the list of the top 20 requestors of H-1B visas should be taken to be an estimate, at least until further cleaning of the data can be done. In any event, another striking thing about the the top 20 requestors is the huge increase in the number of H-1B workers requested by 'ERNST & YOUNG U.S. LLP'. The following code searches for other spellings:

In [112]:
ggey = group_lca1518(20, ['EMPLOYER_NAME'], ['EMPLOYER_NAME','CASE_STATUS'], ['ERNST ','CERTIFIED$'], True)
print(ggey)
ggey = group_lca(dd15, '2015', ['EMPLOYER_NAME'], ['EMPLOYER_NAME','CASE_STATUS'], ['ERNST ','CERTIFIED$'], True)
print(ggey)
ggey = group_lca(dd16, '2016', ['EMPLOYER_NAME'], ['EMPLOYER_NAME','CASE_STATUS'], ['ERNST ','CERTIFIED$'], True)
print(ggey)
ggey = group_lca(dd17, '2017', ['EMPLOYER_NAME'], ['EMPLOYER_NAME','CASE_STATUS'], ['ERNST ','CERTIFIED$'], True)
print(ggey)
                                2015   2016    2017     2018
EMPLOYER_NAME                                               
ERNST & YOUNG U.S. LLP         4,115  6,879  12,273  138,660
ERNST & YOUNG U.S. LLP         8    810     125    2,106
ERNST & YOUNG LLP (CNMI) INC.    NaN    NaN     NaN        1
                                        2015
EMPLOYER_NAME                               
ERNST & YOUNG U.S. LLP                 4,115
ERNST & YOUNG U.S. LLP                 8
ERNST & YOUNG LLP (EY GUAM)                7
ERNST &  YOUNG U.S. LLP                    5
ERNST &YOUNG U.S. LLP                      4
ERNST & YOUNG LLP (EY CNMI)                3
ERNST & YOUNG (EY CNMI)                    1
ROTHWELL, FIGG, ERNST & MANBECK, P.C.      1
                                2016
EMPLOYER_NAME                       
ERNST & YOUNG U.S. LLP         6,879
ERNST & YOUNG U.S. LLP       810
ERNST &YOUNG U.S. LLP             69
ERNST &  YOUNG U.S. LLP           33
ERNST & YOUNG LLP                 22
ERNST AND YOUNG U.S. LLP          22
ERNST & YOUNG LLP (EY GUAM)        5
ERNST &; YOUNG U.S. LLP            2
ERNST &AMP YOUNG U.S. LLP          2
ERNST AND YOUNG LLP (EY GUAM)      2
ERNST  & YOUNG U.S. LLP            1
ERNST & YOUNG U.S LLP              1
ERNST &  YOUNG U.S. LLP        1
                                              2017
EMPLOYER_NAME                                     
ERNST & YOUNG U.S. LLP                      12,273
ERNST & YOUNG U.S. LLP                     125
ERNST & YOUNG LLP (EY GUAM)                     38
ERNST &  YOUNG U.S. LLP                         14
ERNST & YOUNG INFRASTRUCTURE ADVISORS, LLC       3
ERNST & YOUNG; YOUNG U.S. LLP                    2
ERNST  YOUNG U.S. LLP                            1
ERNST & YOUNG U.S LLP                            1
ROTHWELL, FIGG, ERNST & MANBECK, P.C.        1

As can be seen, there's many alternate names but none requesting a large number of workers. The following plot shows the change in H-1B workers requested for the top 6 requestors in 2018. This ignores any of the alternate names above or for any of the other top 6 requestors:

In [113]:
import pandas as pd
import matplotlib.pyplot as plt

gg6 = group_lca1518(6, ['EMPLOYER_NAME'], ['CASE_STATUS'], ['CERTIFIED'], False, True)
gg6T = gg6.T
gg6T.columns = ['ERNST & YOUNG','DELOITTE','COGNIZANT','HCL AMERICA','APPLE','QUALCOMM']
print(gg6T)
fig, ax = plt.subplots(1, 1, figsize=(12, 8))
ax.plot(gg6T)
ax.set_title('CERTIFIED H-1B WORKERS REQUESTED FOR TOP 6 REQUESTORS IN 2018')
ax.set_xlabel('Year')
ax.set_ylabel('Certified H-1B Workers Requested')
ax.grid(zorder=0)
ax.legend(gg6T.columns)
      ERNST & YOUNG  DELOITTE  COGNIZANT  HCL AMERICA    APPLE  QUALCOMM
2015         4115.0  118880.0    82611.0      15762.0   8789.0    1656.0
2016         6879.0  122189.0    95814.0      12668.0  23028.0    5396.0
2017        12273.0   93658.0    50755.0      39668.0  27223.0   11672.0
2018       138660.0   31972.0    29693.0      23792.0  19925.0   15612.0
Out[113]:
<matplotlib.legend.Legend at 0x1e78449fc50>

The plot shows the huge increase in H-1B workers requested by 'ERNST & YOUNG U.S. LLP' in 2018. It does show a decrease in workers requested by 'DELOITTE CONSULTING LLP', another of the "Big Four accounting firms". However, the decrease in 2018 may be due in part to the fact that the data only includes the first half of fiscal year 2018. In any event, the reason for the large increase in requests for Ernst & Young is unclear. In order to get a more detailed view of the increase, the following code shows the values of SOC_NAME for the requested workers. SOC_NAME is the Occupational Name of the job as classified by the Standard Occupational Classification (SOC) system.

In [114]:
ggsoc = group_lca1518(20, ['SOC_NAME'],['CASE_STATUS','EMPLOYER_NAME'],['CERTIFIED','ERNST & YOUNG U.S. LLP'])
print(ggsoc)
                                                 2015   2016   2017    2018
SOC_NAME                                                                   
ACCOUNTANTS AND AUDITORS                        1,533  1,847  4,760  37,242
COMPUTER SYSTEMS ANALYSTS                       1,045  1,839  2,183  21,290
FINANCIAL SPECIALISTS, ALL OTHER                   44    832    411  17,755
MANAGEMENT ANALYSTS                             1,329  2,199  4,229  17,101
COMPUTER OCCUPATIONS, ALL OTHER                   NaN      1    222  12,429
OPERATIONS RESEARCH ANALYSTS                       31     72    133  10,189
FINANCIAL ANALYSTS                                 30     19    200   7,502
ACTUARIES                                          53     39     64   3,850
INFORMATION SECURITY ANALYSTS                      18     10      9   2,794
STATISTICIANS                                       2    NaN      7   1,870
SOFTWARE DEVELOPERS, APPLICATIONS                   2      3    NaN   1,453
LOGISTICIANS                                        3      2    NaN   1,100
COMPUTER NETWORK ARCHITECTS                       NaN    NaN    NaN     600
DATABASE ADMINISTRATORS                             1      2    NaN     580
MATHEMATICIANS                                    NaN    NaN      2     550
SOFTWARE DEVELOPERS, SYSTEMS SOFTWARE             NaN    NaN    NaN     492
INDUSTRIAL ENGINEERS                              NaN    NaN    NaN     400
HUMAN RESOURCES SPECIALISTS                         5      3     25     301
CLAIMS ADJUSTERS, EXAMINERS, AND INVESTIGATORS    NaN    NaN    NaN     300
COMPUTER AND INFORMATION SYSTEMS MANAGERS           5      2      5     240

As can be seen, the additional H-1B workers requested are chiefly for accounting, computer, and financial occupations. It would seem possible that the increase may have something to do with the 2017 Tax Cuts and Jobs Act. Any connection may become more clear with the information about requests during the rest of 2018, the number of requests that are actually approved, and any reported changes in the current work force of Ernst and Young. In any event, the large increase in requests would seem to merit additional study.