J Occup Environ Hyg. 2022 Jul;19(7):437-447. doi: 10.1080/15459624.2022.2076860. Epub 2022 Jun 7.
Recently, the National Institute for Occupational Safety and Health (NIOSH) released an updated version of the NIOSH Industry and Occupation Computerized Coding System (NIOCCS), which uses supervised machine learning to assign industry and occupational codes based on provided free-text information. However, no efforts have been made to externally verify the quality of assigned industry and job titles when the algorithm is provided with inputs of varying quality. This study sought to evaluate whether the NIOCCS algorithm was sufficiently robust with low-quality inputs and how variable quality could impact subsequent job estimated exposures in a large job-exposure matrix for noise (NoiseJEM). Using free-text industry and job descriptions from >700,000 noise measurements in the NoiseJEM, three files were created and input into NIOCCS: (1) N1, “raw” industries and job titles; (2) N2, “refined” industries and “raw” job titles; and (3) N3, “refined” industries and job titles. Standardized industry and occupation codes were output by NIOCCS. Descriptive statistics of performance metrics (e.g., misclassification/discordance of occupation codes) were evaluated for each input relative to the original NoiseJEM dataset (N0). Across major Standardized Occupational Classifications (SOC), total discordance rates for N1, N2, and N3 compared to N0 were 53.6%, 42.3%, and 5.0%, respectively. The impact of discordance on the major SOC group varied and included both over- and under-estimates of average noise exposure compared to N0. N2 had the most accurate noise exposure estimates (i.e., smallest bias) across major SOC groups compared to N1 and N3. Further refinement of job titles in N3 showed little improvement. Some variation in classification efficacy was seen over time, particularly prior to 1985. Machine learning algorithms can systematically and consistently classify data but are highly dependent on the quality and amount of input data. The greatest benefit for an end-user may come from cleaning industry information before applying this method for job classification. Our results highlight the need for standardized classification methods that remain constant over time.