Abstract
The paper presents an application of interpretative machine learning to identify groups of lakes not with similar features but with similar potential factors influencing the content of total phosphorus – Ptot. The method was developed on a sample of 60 lakes from North-Eastern Poland and used 25 external explanatory variables. Selected variables are stable over a long time, first group includes morphometric parameters of lakes and the second group en- compass watershed geometry geology and land use. Our method involves building a regression model, creating an ex- plainer, finding a set of mapping functions describing how each variable influences the outcome, and finally clustering objects by ’the influence’. The influence is a non-linear and non-parametric transformation of the explanatory variables into a form describing a given variable impact on the modeled feature. Such a transformation makes group data on the functional relations between the explanatory variables and the explained variable possible. The study reveals that there are five clusters where the concentration of Ptot is shaped similarly. We compared our method with other numerical analyses and showed that it provides new information on the catchment area and lake trophy relationship.
Funding
This work was founded by Polish National Science Center No.2016/23/D/ST10/03071 Czego możemy nauczyć się od wioślarek (Cladocera)? Wykorzystanie zbioru testowego i nowoczesnych metod statystycznych do rekonstrukcji zmian środowiska.
References
Aggarwal C.C., Hinneburg A., Keim D.A., 2001. On the surprising behavior of distance metrics in high dimensional space. In: Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics): 420-434. DOI: https://www.doi.org/10.1007/3-540-44503-x_27.
Akbar T.A., Hassan Q.K., Achari G., 2011. A methodology for clustering lakes in Alberta on the basis of water quality parameters. Clean – Soil, Air, Water 39: 916-924. DOI: https://www.doi.org/10.1002/clen.201100050.
Apolinarska K., Pleskot K., Pełechata A., Migdałek M., Siepak M., Pełechaty M., 2020. The recent deposition of laminated sediments in highly eutrophic Lake Kierskie, Western Poland: 1 year pilot study of limnological monitoring and sediment traps. Journal of Paleolimnology 63: 283-304. DOI: https://www.doi.org/10.1007/s10933-020-00116-2.
Bajkiewicz-Grabowska E., 2020. Geoecosystems of Polish Lakes. In: Korzeniewska E., Harnisz M. (eds), Polish River Basins and Lakes – Part I. The handbook of environmental chemistry, vol. 86. Springer, Cham. DOI: https://www.doi.org/10.1007/978-3-030-12123-5_3.
Beaulieu, M., Pick, F., Palmer, M., Watson, S., Winter, J., Zurawell, R., Gregory-Eaves, I., 2014. Comparing predictive cyanobacterial models from temperate regions. Canadian Journal of Fisheries and Aquatic Sciences 71: 1830-1839. DOI: https://www.doi.org/10.1139/CJFAS-2014-0168/SUPPL_FILE/CJ-FAS-2014-0168SUPPLC.PDF.
Benedini M., Tsakiris G., 2013. Water quality modelling for rivers and streams. Springer, p 233. DOI: https://www.doi.org/10.1007/978-94-007-5509-3.
Biecek P., 2018. DALEX: explainers for complex predictive models in r. The Journal of Machine Learning Research 19: 3245-3249.
Borics G., Nagy L., Miron S., Grigorszky I., László-Nagy Z., Lukács B.A., G-Tóth L., Várbíró G., 2013. Which factors affect phytoplankton biomass in shallow eutrophic lakes? Hydrobiologia 714: 93-104. DOI: https://www.doi.org/10.1007/S10750-013-1525-6/FIGURES/3.
Bourel M., Segura A.M., 2018. Multiclass classification methods in ecology. Ecological Indicators 85: 1012-1021. DOI: https://www.doi.org/10.1016/J.ECOLIND.2017.11.031.
Breiman L., 2001. Random forests. Machine Learning 45: 5-32. DOI: https://www.doi.org/10.1023/A:1010933404324.
Chen V., Li J., Kim J.S., Plumb G., Talwalkar A., 2021. Interpretable machine learning. Queue 19: 28-56. DOI: https://www.doi.org/10.1145/3511299.
Cox T., Cox M., 2000. Multidimensional scaling. 2nd edition. Chapman and Hall/CRC, p 328. DOI: https://www.doi.org/10.1201/9780367801700.
Cui H., Ou Y., Wang L., Wu H., Yan B., Han L., Li Y., 2019. Identification of environmental factors controlling phosphorus fractions and mobility in restored wetlands by multivariate statistics. Environmental Science and Pollution Research 26: 16014-16025. DOI: https://www.doi.org/10.1007/s11356-019-05028-x.
Dafforn K.A., Johnston E.L., Ferguson A., Humphrey C., Monk W., Nichols S.J., Simpson S.L., Tulbure M.G., Baird D.J., 2015. Big data opportunities and challenges for assessing multiple stressors across scales in aquatic ecosystems. Marine and Freshwater Research 67: 393-413. DOI: https://www.doi.org/10.1071/MF15108.
Dormann C.F., Elith J., Bacher S., Buchmann C., Carl G., Carré G., Marquéz J.R., Gruber B., Lafourcade B., Leitão P.J., Münkemüller T., Mcclean C., Osborne P.E., Reineking B., Schröder B., Skidmore A.K., Zurell D., Lautenbach S., 2013. Collinearity: a review of methods to deal with it and a simulation study evaluating their performance. Ecography 36: 27-46. DOI: https://www.doi.org/10.1111/J.1600-0587.2012.07348.X.
EEA 2018. Corine land cover (CLC) 2018, version 2020-2001. Online: https://land.copernicus.eu/pan-european/corine-land-cover/clc2018 (accessed: XXX).
Eliasz-Kowalska M., Wojtal A.Z., 2020. Limnological characteristics and diatom dominants in lakes of Northeastern Poland. Diversity 12: 1-16. DOI 10.3390/d12100374.
Friedman J.H., 2001. Greedy function approximation: A gradient boosting machine. Annals of Statistics 29: 1189-1232.
Froeschke J.T., Froeschke B.F., 2011. Spatio-temporal predictive model based on environmental factors for juvenile spotted seatrout in Texas estuaries using boosted regression trees. Fisheries Research 111: 131-138. DOI: https://www.doi.org/10.1016/j.fishres.2011.07.008.
Gebler D., Kolada A., Pasztaleniec A., Szoszkiewicz K., 2021. Modelling of ecological status of Polish lakes using deep learning techniques. Environmental Science and Pollution Research 28: 5383-5397. DOI: https://www.doi.org/10.1007/s11356-020-10731-1.
Genuer R., Poggi J.M., Tuleau-Malot C., 2010. Variable selection using random forests. Pattern Recognition Letters 31: 2225-2236. DOI: https://www.doi.org/10.1016/j.patrec.2010.03.014.
Goggin M.L., 1986. The “Too Few Cases/Too Many Variables” problem in implementation research. The Western Political Quarterly 39: 328. DOI: https://www.doi.org/10.2307/448302.
Gorgoglione A., Gregorio J., Ríos A., Alonso J., Chreties C., Fossati M., 2020. Influence of land use/land cover on surface-water quality of Santa Lucia River, Uruguay. Sustainability (Switzerland) 12. DOI: https://www.doi.org/10.3390/su12114692.
Guan M., Sillanpää N., Koivusalo H., 2016. Storm runoff response to rainfall pattern, magnitude and urbanization in a developing urban catchment. Hydrological Processes 30: 543-557. DOI: https://www.doi.org/10.1002/HYP.10624.
Håkanson L., 2005. The importance of lake morphometry and catchment characteristics in limnology – Ranking based on statistical analyses. Hydrobiologia 541: 117-137. DOI: https://www.doi.org/10.1007/s10750-004-5032-7.
Harrell F.E., 2015. Regression modeling strategies: With applications to linear models, logistic regression, and survival analysis. Springer, New York, p 582. DOI: https://www.doi.org/10.1007/978-3-319-19425-7.
Hernández-Almeida I., Grosjean M., Gómez-Navarro J.J., Larocque-Tobler I., Bonk A., Enters D., Ustrzycka A., Pi otrowska N., Przybylak R., Wacnik A., Witak M., Tylmann W., 2017. Resilience, rapid transitions and regime shifts: Fingerprinting the responses of Lake Zabińskie (NE Poland) to climate variability and human disturbance since AD 1000. The Holocene 27: 258-270. DOI: https://www.doi.org/10.1177/0959683616658529.
Hollister J.W., Milstead W.B., Kreakie B.J., 2016. Modeling lake trophic state: A random forest approach. Ecosphere 7: 1-14. DOI: https://www.doi.org/10.1002/ecs2.1321.
Huang J., Gao J., Zhang Y., 2015. Combination of artificial neural network and clustering techniques for predicting phytoplankton biomass of Lake Poyang, China. Limnology 16: 179-191. DOI: https://www.doi.org/10.1007/S10201-015-0454-7/TA-BLES/5.
Jańczak J., 1999. The Atlas of Polish Lakes, vol. 3 Masurian Lakes and the Southern Part of Poland. Bogucki Wydawnictwo Naukowe, Poznań.
Jasiewicz J., Metz M., 2011. A new GRASS GIS toolkit for Hortonian analysis of drainage networks. Computers and Geosciences 37: 1162-1173. DOI: https://www.doi.org/10.1016/j.cageo.2011.03.003.
Jasiewicz J., Niedzielski P., Krueger M., Hildebrandt-Radke I., Michałowski A., 2021. Elemental variability of prehistoric ceramics from postglacial lowlands and its implications for emerging of pottery traditions – an example from the pre-roman iron age. Journal of Archaeological Science: Reports 39: 103177.
Jones J.R., Knowlton M.F., Obrecht D.V., Cook E.A., 2004. Importance of landscape variables and morphology on nutrients in Missouri reservoirs. Canadian Journal of Fisheries and Aquatic Sciences 61: 1503-1512. DOI: https://www.doi.org/10.1139/F04-088.
Jones K.B., Neale A.C., Nash M.S., Van Remortel R.D., Wickham J.D., Riitters K.H., O’Neill R.V., 2001. Predicting nutrient and sediment loadings to streams from landscape metrics: A multiple watershed study from the United States Mid-Atlantic Region. Landscape Ecology 16: 301-312. DOI: https://www.doi.org/10.1023/A:1011175013278.
Kandel D.D., Western A.W., Grayson R.B., Turral H.N., 2004. Process parameterization and temporal scaling in surface runoff and erosion modelling. Hydrological Processes 18: 1423-1446. DOI: https://www.doi.org/10.1002/HYP.1421.
Kallf J., 2001. Limnology: inland water ecosystems. Prentice Hall, New Jersey, p 592.
Kocev D., Ceci M., Stepišnik T., 2020. Ensembles of extremely randomized predictive clustering trees for predicting structured outputs. Machine Learning 109: 2213-2241. DOI: https://www.doi.org/10.1007/S10994-020-05894-4/FIGURES/14.
Kondracki J., 2009. Geografia regionalna Polski. Wydanie trzecie, Wydawnictwo Naukowe PWN, Kraków.
Lange W., 1986. Fizyczno-limnologiczne uwarunkowania tolerancji systemów jeziornych Pomorza. Zeszyty Naukowe UG Rozprawy i monografie nr 79, Gdańsk, 3-177.
Leach T.H., Beisner B.E., Carey C.C., Pernica P., Rose K.C., Huot Y., Brentrup J.A., Domaizon I., Grossart H.P., Ibelings B.W., Jacquet S., Kelly P.T., Rusak J.A., Stockwell J.D., Straile D., Verburg P., 2018. Patterns and drivers of deep chlorophyll maxima structure in 100 lakes: The relative importance of light and thermal stratification. Limnology and Oceanography 63: 628-646. DOI: https://www.doi.org/10.1002/lno.10656.
Li B., Yang G., Wan R., Dai X., Zhang Y., 2016. Comparison of random forests and other statistical methods for the prediction of Lake water level: A case study of the Poyang Lake in China. Hydrology Research 47: 69-83. DOI: https://www.doi.org/10.2166/nh.2016.264.
Li B., Yang G., Wan R., Hörmann G., Huang J., Fohrer N., Zhang L., 2017. Combining multivariate statistical techniques and random forests model to assess and diagnose the trophic status of Poyang Lake in China. Ecological Indicators 83: 74-83. DOI: https://www.doi.org/10.1016/j.ecolind.2017.07.033.
Li T., Li S., Liang C., Bush R.T., Xiong L., Jiang Y., 2018. A comparative assessment of Australia’s Lower Lakes water quality under extreme drought and post-drought conditions using multivariate statistical techniques. Journal of Cleaner Production 190: 1-11. DOI: https://www.doi.org/10.1016/j.jcle-pro.2018.04.121.
Li W., Zhang Y., Cui L., Zhang M., Wang Y., 2015. Modeling total phosphorus removal in an aquatic environment restoring horizontal subsurface flow constructed wetland based on artificial neural networks. Environmental Science and Pollution Research 22: 12347-12354. DOI: https://www.doi.org/10.1007/S11356-015-4527-2/TABLES/2.
Lundberg S.M., Lee S.I., 2017. A unified approach to interpreting model predictions. arXiv, 1-10. Online: https://github.com/slundberg/shap (accessed ??.??.????).
Marks L., 2012. Timing of the Late Vistulian (Weichselian) glacial phases in Poland. Quaternary Science Reviews 44: 81-88. DOI: https://www.doi.org/10.1016/j.quascirev.2010.08.008.
Marks L., Ber A., Gogo Lek, W., Piotrowska K., 2006. Geological map of Poland 1:500000. Państwowy Instytut Geologiczny, Warszawa.
Molnar C., Casalicchio G., Bischl B., 2020. Interpretable machine learning – a brief history, state-of-the-art and challenges. In: Hands-on machine learning with R, 417-431. DOI: https://www.doi.org/10.1007/978-3-030-65965-3_28.
Morawski W., 2005. Warmińska prowincja paleogeograficzna plejstocenu (północno-wschodnia Polska). Przeglad Geologiczny 53: 477-488.
Ohle W., 1956. Bioactivity, production, and energy utilization of lakes. Limnology and Oceanography 1: 139-149. DOI: https://www.doi.org/10.4319/lo.1956.1.3.0139.
Pochocka-Szwarc K., 2013. Some aspects of the last glaciation in the Mazury Lake District (north-eastern Poland). Acta Palaeobotanica 53: 3-8. DOI: https://www.doi.org/10.2478/acpa-2013-0001.
Ribeiro M.T., Singh S., Guestrin C., 2016. Why Should I Trust You? In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, New York, NY, USA. 1135-1144. DOI: https://www.doi.org/10.1145/2939672.2939778.
Rocha J.C., Peres C.K., Buzzo J.L.L., de Souza V., Krause E.A., Bispo P.C., Frei F., Costa L.S., Branco C.C., 2017. Modeling the species richness and abundance of lotic macroalgae based on habitat characteristics by artificial neural networks: a potentially useful tool for stream biomonitoring programs. Journal of Applied Phycology 29: 2145-2153. DOI: https://www.doi.org/10.1007/s10811-017-1107-5.
Rodhe W., 1969. Crystallization of eutrophication concepts in northern Europe. In: Eutrophication: causes, consequences, correctives. National Academy of Sciences, Washington: 50-64.
Schindler D.W., 1977. Evolution of phosphorus limitation in lakes. Science 195: 260-262. DOI: https://www.doi.org/10.1126/science.195.4275.260.
Shapely L.S., 1953. A value of n-person games. In: Kuhn H., Tucker A. (eds.) Contribution to the theory of games II. Princeton University, Princeton, 307-317.
Shrikumar A., Greenside P., Kundaje A., 2017. Learning important features through propagating activation differences. In: 34th International Conference on Machine Learning, ICML 2017, 4844-4866. arXiv:1704.02685.
Simeonov V., Simeonova P., Tsakovski S., Lovchinov V., 2010. Lake water monitoring data assessment by multivariate statistics. Journal of Water Resource and Protection 2: 353-361. DOI: https://www.doi.org/10.4236/jwarp.2010.24041.
Staehr P.A., Baastrup-Spohr L., Sand-Jensen K., Stedmon C., 2012. Lake metabolism scales with lake morphometry and catchment conditions. Aquatic Sciences 74: 155-169. DOI: https://www.doi.org/10.1007/s00027-011-0207-6.
Su S., Li D., Zhang Q., Xiao R., Huang F., Wu J., 2011. Temporal trend and source apportionment of water pollution in different functional zones of Qiantang River, China. Water Research 45: 1781-1795. DOI: https://www.doi.org/10.1016/J.WATRES.2010.11.030.
Sun A.Y., Scanlon B.R., 2019. How can big data and machine learning benefit environment and water management: A survey of methods, applications, and future directions. Environmental Research Letters 14(7): 073001. DOI: https://www.doi.org/10.1088/1748-9326/ab1b7d.
Tandyrak R., Grochowska J., Parszuto K., Augustyniak R., Łopata M., 2020. Environmental conditions in Polish lakes with different types of catchments. In: Korzeniewska E., Harnisz M. (eds), Polish River Basins and Lakes – Part I. The handbook of environmental chemistry, vol 86. Springer, Cham. 119-138.
Tylmann W., Szpakowska K., Ohlendorf C., Woszczyk M., Zolitschka B., 2012. Conditions for deposition of annually laminated sediments in small meromictic lakes: a case study of Lake Suminko (northern Poland). Journal of Paleolimnology 47: 55-70. DOI: https://www.doi.org/10.1007/s10933-011-9548-3.
Yeo I.N., Johnson R.A., 2000. A new family of power transformations to improve normality or symmetry. Biometrika 87: 954-959. DOI: https://www.doi.org/10.1093/biomet/87.4.954.
Weckwerth P., Wysota W., Piotrowski J.A., Adamczyk A., Krawiec A., Dąbrowski M., 2019. Late Weichselian glacier outburst floods in North-Eastern Poland: landform evidence and palaeohydraulic significance. Earth-Science Reviews 194: 216-233. DOI: https://www.doi.org/10.1016/j.earscirev.2019.05.006.