On the Choice of Linear Regression Algorithms for Biological and Ecological Applications
Vasco M. N. C. S. Vieira *
MARETEC, Instituto Superior Técnico, Universidade Técnica de Lisboa, Av. Rovisco Pais, 1049-001, Lisboa, Portugal.
Joel Creed
Departamento de Ecologia, Instituto de Biologia Roberto Alcântara Gomes, Universidade do Estado do Rio de Janeiro, Rua São Francisco Xavier 524, 20.559-900, Rio de Janeiro, Brazil.
Ricardo A. Scrosati
Department of Biology, Saint Francis Xavier University, Antigonish, Nova Scotia B2G 2W5, Canada.
Anabela Santos
Universidade Autónoma de Lisboa, Rua de Santa Marta, nº 56 - 1169-023, Lisboa, Portugal.
Georg Dutschke
Universidade Autónoma de Lisboa, Rua de Santa Marta, nº 56 - 1169-023, Lisboa, Portugal.
Francisco Leitão
CCMAR, Center of Marine Science, University of Algarve, Campus Gambelas, 8005-139 Faro, Portugal.
Aschwin H. Engelen
CCMAR, Center of Marine Science, University of Algarve, Campus Gambelas, 8005-139 Faro, Portugal.
Oscar R. Huanel
Instituto de Ciencias Ambientales y Evolutivas, Facultad de Ciencias, Universidad Austral de Chile, Casilla 567, Valdivia, Chile.
Marie-Laure Guillemin
Instituto de Ciencias Ambientales y Evolutivas, Facultad de Ciencias, Universidad Austral de Chile, Casilla 567, Valdivia, Chile and CNRS, Sorbonne Universités, UPMC University Paris VI, UMI 3614, Evolutionary Biology and Ecology of Algae, Station Biologique de Roscoff, CS 90074, Place G. Tessier, 296888 Roscoff, France.
Marcos Mateus
MARETEC, Instituto Superior Técnico, Universidade Técnica de Lisboa, Av. Rovisco Pais, 1049-001, Lisboa, Portugal.
Ramiro Neves
MARETEC, Instituto Superior Técnico, Universidade Técnica de Lisboa, Av. Rovisco Pais, 1049-001, Lisboa, Portugal.
*Author to whom correspondence should be addressed.
Abstract
Model II regression (i.e. minimizing residuals obliquely) is the adequate alternative to Model I regression by Ordinary Least Squares (i.e. minimizing residuals vertically) given the absence of well-established dependence relationships or x measured with error. Yet, it has no perfect solution. Determining the true slope from errors-in-the-variables models requires the errors in x and y estimated from higher order moments. However, their accurate estimation requires enormous data sets and thus they are not applicable to most ecological problems. The alternative Reduced Major Axis (RMA) is dependent on a strict set of assumptions, hardly met with real data, making it prone to bias, whereas Principal Components Analysis (PCA) becomes less reliable with decreasing correlations while x and y presenting approximate variances. We used artificial data (allowing for the determination of the true slope) to demonstrate when RMA or PCA should be preferred. Consequently, we propose using PCA whenever r2+s2x/s2y is higher than 1.5. Otherwise, we suggest generating artificial data manipulated to match the structure of the original, and to test which method provides closer estimates to the input true slope. We provide a user-friendly script to perform this task. We tested the use of RMA and PCA with real data about intraspecific and interspecific biomass-density relations in algae and seagrass, algae frond growth, crustacean and bird morphometry, sardine fisheries and social sciences data, commonly finding widely divergent slope estimates leading to severely biased parameter estimations and model applications. Their analyses support the suggested approach for method selection summarized above.
Keywords: Model II regression, Principal Components Analysis, Reduced Major Axis.