Study on the performance of Robust LASSO in determining important variables data with outliers

ROCHYATI ROCHYATI, KUSMAN SADIK, BAGUS SARTONO, EVITA PURNANINGRUM

Abstract


A variable selection method is required to deal with regression models with many variables, and LASSO has been the most widely used methodology.  However, as several authors have noted, LASSO is sensitive to outliers in the data.  For this reason, the Robust-LASSO approach was introduced by applying some weighting schemes for each sample in the data.  This research presented a comparative study of the three weighting schemes in Robust LASSO, namely Huber-LASSO, Tukey-LASSO, and Welsch-LASSO.  The study did a rich simulation containing many scenarios with various characteristics on the covariance structures of the explanatory variable, the types of outliers, the number of outliers, the location of active variables, and the number of variables.  The study then found that Tukey-LASSO outperformed Huber-LASSO and Welsch-LASSO in identifying significant variables.  The Robust LASSO performance generally decreased as the covariances among explanatory variables increased and the data dimension increased.  Exploration of sembung leaf extract data shows that the data is high dimensional data which contains outliers of about 14,28% on the response variable and about 25,71% on the explanatory variables.  Based on the research, the number of variables selected using the Tukey-LASSO method was nine compounds, Huber-LASSO and Welsch-LASSO were eight compounds, and LASSO 13 compounds.  The Tukey-LASSO prediction accuracy is superior to the other three methods.

Keywords


high dimensional regression, Huber, Tukey, variable selection, Welsch

References


Varin S. 2021. Comparing the predictive performance of OLS and 7 robust linear regression estimators on a real and simulated datasets. Int. J. Eng. Appl. Sci. Technol. 5 (11) 9-23. DOI:10.33564/ijeast.2021.v05i11.002.

Lima, E.; Davies, P.; Kaler, J.; Lovatt, F.; Green, M. 2020. Variable selection for inferential models with relatively high-dimensional data: Between method heterogeneity and covariate stability as adjuncts to robust selection. Sci. Rep. 10 (1) 8002. DOI:10.1038/s41598-020-64829-0.

Kim, Y.; Hao, J.; Mallavarapu, T.; Park, J.; Kang, M. 2019. Hi-LASSO: High-Dimensional LASSO. IEEE Access 7 44562–44573. DOI:10.1109/access.2019.2909071

Li, Q.; Wang, L. 2020. Robust change point detection method via adaptive LAD-LASSO. Stat Pap. 61 (1). DOI:10.1007/s00362-017-0927-3.

Chang, L.; Roberts, S.; Welsh, A. 2018. Robust lasso regression using Tukey’s biweight criterion. Technometrics 60 (1) 36–47. DOI:10.1080/00401706.2017.1305299.

Li, S.; Qin, Y. 2022. Maximum tangent likelihood estimation for linear regression. CRAN 1–9. https://github.com/shaobo-li/MTE

Hubert, M.; Rousseeuw, P.J.; Vanden Branden, K. 2005. ROBPCA: A new approach to robust principal component analysis. Technometrics. 47 (1) 64-79. DOI:10.1198/004017004000000563.

Mazimba O. 2017. Umbelliferone: Sources, chemistry and bioactivities review. Bull. Fac. Pharmacy, Cairo Univ. 55 (2) 223-232. DOI:10.1016/j.bfopcu.2017.05.001.

Ozgen, S.; Kilinc, O.K.; Selamoğlu, Z. 2016. Antioxidant activity of quercetin: a mechanistic review. Turkish J. Agric. - Food Sci. Technol. 4 (12) 1134-1138. doi:10.24925/turjaf.v4i12.1134-1138.1069.

Ruhardi, A.; Sahumena, M.H. 2021. Identifikasi Senyawa Flavonoid Daun Sembung (Blumea balsamifera L .) [Identification of Flavonoid Compounds in Sembung Leaves (Blumea balsamifera L.)]. J. Syifa Sci. Clin. Res. 3 29–36. DOI: 10.37311/jsscr.v3i1.9925.


Full Text: PDF

DOI: 10.24815/jn.v23i1.26279

Refbacks

  • There are currently no refbacks.