Applications of machine learning in earth sciences include geological mapping, gas leakage detection and

geological features identification Geology () is a branch of natural science concerned with Earth and other astronomical objects, the features or rocks of which it is composed, and the processes by which they change over time. Modern geology significantly overlaps all other Ea ...

. Machine learning (ML) is a type of artificial intelligence (AI) that enables computer systems to classify, cluster, identify and analyze vast and complex sets of data while eliminating the need for explicit instructions and programming.Mueller, J. P., & Massaron, L. (2021). ''Machine learning for dummies''. John Wiley & Sons.

Earth science Earth science or geoscience includes all fields of natural science related to the planet Earth. This is a branch of science dealing with the physical, chemical, and biological complex constitutions and synergistic linkages of Earth's four spheres ...

is the study of the origin, evolution, and future of the planet Earth. The Earth system can be subdivided into four major components including the solid earth,

atmosphere An atmosphere () is a layer of gas or layers of gases that envelop a planet, and is held in place by the gravity of the planetary body. A planet retains an atmosphere when the gravity is great and the temperature of the atmosphere is low. A s ...

, hydrosphere and biosphere. A variety of algorithms may be applied depending on the nature of the

earth science Earth science or geoscience includes all fields of natural science related to the planet Earth. This is a branch of science dealing with the physical, chemical, and biological complex constitutions and synergistic linkages of Earth's four spheres ...

exploration. Some algorithms may perform significantly better than others for particular objectives. For example, convolutional neural networks (CNN) are good at interpreting images, artificial neural networks (ANN) perform well in soil classification but more computationally expensive to train than

support-vector machine In machine learning, support vector machines (SVMs, also support vector networks) are supervised learning models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratorie ...

(SVM) learning. The application of machine learning has been popular in recent decades, as the development of other technologies such as unmanned aerial vehicles (UAVs), ultra-high resolution remote sensing technology and high-performance computing units lead to the availability of large high-quality datasets and more advanced algorithms.

Significance

Complexity of earth science

Problems in earth science are often complex. It is difficult to apply well-known and described

mathematical model A mathematical model is a description of a system using mathematical concepts and language. The process of developing a mathematical model is termed mathematical modeling. Mathematical models are used in the natural sciences (such as physics, ...

s to the natural environment, therefore machine learning is commonly a better alternative for such non-linear problems. Ecological data are commonly non-linear and consist of higher-order interactions, and together with missing data, traditional statistics may underperform as unrealistic assumptions such as linearity are applied to the model. A number of researchers found that machine learning outperforms traditional statistical models in earth science, such as in characterizing forest canopy structure, predicting climate-induced range shifts, and delineating geologic facies. Characterizing forest canopy structure enables scientists to study vegetation response to climate change. Predicting climate-induced range shifts enable policy makers to adopt suitable conversation method to overcome the consequences of climate change. Delineating geologic facies helps geologists to understand the geology of an area, which is essential for the development and management of an area.

Inaccessible data

In Earth Sciences, some data are often difficult to access or collect, therefore inferring data from data that are easily available by machine learning method is desirable. For example, geological mapping in tropical rainforests is challenging because the thick vegetation cover and rock outcrops are poorly exposed. Applying remote sensing with machine learning approaches provides an alternative way for rapid mapping without the need of manually mapping in the unreachable areas.

Reduce time costs

Machine learning can also reduce the efforts done by experts, as manual tasks of classification and annotation etc are the bottlenecks in the workflow of the research of earth science. Geological mapping, especially in a vast, remote area is labour, cost and time-intensive with traditional methods. Incorporation of remote sensing and machine learning approaches can provide an alternative solution to eliminate some field mapping needs.

Consistent and bias-free

Consistency and bias-free is also an advantage of machine learning compared to manual works by humans. In research comparing the performance of human and machine learning in the identification of

dinoflagellate The dinoflagellates (Greek δῖνος ''dinos'' "whirling" and Latin ''flagellum'' "whip, scourge") are a monophyletic group of single-celled eukaryotes constituting the phylum Dinoflagellata and are usually considered algae. Dinoflagellates are ...

s, machine learning is found to be not as prone to systematic bias as humans. A recency effect that is present in humans is that the classification often biases towards the most recently recalled classes. In a labelling task of the research, if one kind of dinoflagellates occurs rarely in the samples, then expert ecologists commonly will not classify it correctly. The systematic bias strongly deteriorate the classification accuracies of humans.

Optimal machine learning algorithm

The extensive usage of machine learning in various fields has led to a wide range of algorithms of learning methods being applied. The machine learning algorithm applied in solving earth science problem in much interest to the researchers. Choosing the optimal algorithm for a specific purpose can lead to a significant boost in accuracy. For example, the lithological mapping of gold-bearing granite-greenstone rocks in Hutti, India with AVIRIS-NG hyperspectral data, shows more than 10% difference in overall accuracy between using Support Vector Machine (SVM) and

random forest Random forests or random decision forests is an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time. For classification tasks, the output of th ...

. Some algorithms can also reveal some important information. 'White-box models' are transparent models in which the results and methodologies can be easily explained, while 'black-box' models are the opposite. For example, although the

(SVM) yielded the best result in landslide susceptibility assessment accuracy, the result cannot be rewritten in the form of expert rules that explain how and why an area was classified as that specific class. In contrast, the

decision tree A decision tree is a decision support tool that uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm that only contains condit ...

has a transparent model that can be understood easily, and the user can observe and fix the bias if any present in the model. If the computational power is a concern, a more computationally demanding learning method such as artificial neural network is less preferred despite the fact that artificial neural network may slightly outperform other algorithms, such as in soil classification. Below are highlights of some commonly applied algorithms. File:SVM explain.png, ''Support Vector Machine (SVM)''
In the Support Vector Machine (SVM), the decision boundary was determined during the training process by the training dataset as represented by the green and red dots. The data of purple falls below the decision boundary, therefore it belongs to the red class. File:K nearest neighbour explain.png, ''K nearest neighbor''
K nearest neighbor classifies data based on their similarities. k is a parameter representing the number of neighbors that will be considered for the voting process. For example, in the figure k = 4, therefore the nearest 4 neighbors are considered. In the 4 nearest neighbors, 3 belong to the red class and 1 belongs to the green class. The purple data is classified as the red class. File:Decision Tree Explain.png, ''Decision Tree''
Decision Tree shows the possible outcomes of related choices. Decision Tree can further be divided into Classification Tree and Regression Tree. The above figure shows a Classification Tree as the outputs are discrete classes. For regression Tree, the output is a number. This is a white-box model which is transparent and the user is able to spot out the bias if any appears in the model. File:Random forest explain.png, ''Random forest''
In random forest, multiple decision trees are used together in an ensemble method. Multiple decision trees are produced during the training of a model. Different decision trees may give up various results. The majority voting/ averaging process gives out the final result. This method yields a higher accuracy of using a single decision tree only. File:Neural network explain.png, ''Neural Networks''
Neural Networks mimic neurons in a biological brain. It consists of multiple layers, where the layers in between are hidden layers. The weights of the connections are adjusted during the training process. As the logic in between is unclear, it is referred to as 'black-box operation'. Convolutional neural network (CNN) is a subclass of Neural Networks, which is commonly used for processing images.

Usage

Mapping

Geological or lithological mapping and mineral prospectivity mapping

Geological or lithological mapping produces maps showing geological features and geological units. Mineral prospectivity mapping utilizes a variety of datasets such as geological maps, aeromagnetic imagery, etc to produce maps that are specialized for mineral exploration. Geological/ Lithological Mapping and Mineral Prospectivity Mapping can be carried out by processing the data with machine-learning techniques with the input of spectral imagery obtained from remote sensing and geophysical data. Spectral imagery is the imaging of selected electromagnetic wavelength bands in the electromagnetic spectrum, while conventional imaging captures three wavelength bands (Red, Green, Blue) in the electromagnetic spectrum.

Random Forest Random forests or random decision forests is an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time. For classification tasks, the output of th ...

and Support Vector Machine (SVM) etc are common algorithms being used with remote sensed geophysical data, while Simple Linear Iterative Clustering-Convolutional Neural Network (SLIC-CNN) and Convolutional Neural Networks (CNN) etc are commonly applied while dealing with aerial photos and images. Large scale mapping can be carried out with geophysical data from airborne and satellite remote sensing geophysical data, and smaller-scale mapping can be carried out with images from Unmanned Aerial Vehicle (UAV) for higher resolution. Vegetation cover is one of the major obstacles for geological mapping with remote sensing, as reported in various research, both in large-scale and small-scale mapping. Vegetation affects the quality of spectral image or obscures the rock information in the aerial images. Landslide susceptibility mapping dataset splitting

Landslide susceptibility mapping dataset splitting

Landslide susceptibility and hazard mapping

Landslide Landslides, also known as landslips, are several forms of mass wasting that may include a wide range of ground movements, such as rockfalls, deep-seated grade (slope), slope failures, mudflows, and debris flows. Landslides occur in a variety of ...

susceptibility refers to the probability of landslide of a place, which is affected by the local terrain conditions. Landslide susceptibility mapping can highlight areas prone to landslide risks which are useful for urban planning and disaster management works. Input dataset for machine learning algorithms usually includes topographic information, lithological information, satellite images, etc. and some may include land use, land cover, drainage information, vegetation cover according to their study needs. In machine learning training for landslide susceptibility mapping, training and testing datasets are required. There are two methods of allocating datasets for training and testing, one is to random split the study area for the datasets, another is to split the whole study into two adjacent parts for the two datasets. To test the classification models, the common practice is to split the study area randomly into two datasets, however, it is more useful that the study area can be split into two adjacent parts so that the automation algorithm can carry out mapping of a new area with the input of expert processed data of adjacent land.

Feature identification and detection

Data Augmentation of rock images revised

Discontinuity analyses

Discontinuities such as a

fault plane In geology, a fault is a planar fracture or discontinuity in a volume of rock across which there has been significant displacement as a result of rock-mass movements. Large faults within Earth's crust result from the action of plate tectonic ...

bedding plane In geology, a bed is a layer of sediment, sedimentary rock, or pyroclastic material "bounded above and below by more or less well-defined bedding surfaces".Neuendorf, K.K.E., J.P. Mehl, Jr., and J.A. Jackson, eds., 2005. ''Glossary of Geology'' ...

etc have important implications in engineering. Rock fractures can be recognized automatically by machine learning through

photogrammetric Photogrammetry is the science and technology of obtaining reliable information about physical objects and the environment through the process of recording, measuring and interpreting photographic images and patterns of electromagnetic radiant ima ...

analysis even with the presence of interfering objects, for example, foliation, rod-shaped vegetation, etc. In machine training for classifying images,

data augmentation Data augmentation in data analysis are techniques used to increase the amount of data by adding slightly modified copies of already existing data or newly created synthetic data from existing data. It acts as a regularizer and helps reduce over ...

is a common practice to avoid

overfitting mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfitt ...

and increase the training dataset. For example, in a research of recognizing rock fractures, 68 images for training and 23 images for the testing dataset were prepared by random splitting.

Data augmentation Data augmentation in data analysis are techniques used to increase the amount of data by adding slightly modified copies of already existing data or newly created synthetic data from existing data. It acts as a regularizer and helps reduce over ...

was then carried out and the training dataset was increased to 8704 images by flip and random crop. The approach was able to recognize the rock fractures accurately in most cases. The Negative Prediction Value (NPV) and the Specificity were over 0.99. This demonstrated the robustness of discontinuity analyses with machine learning.

Carbon dioxide leakage detection

Quantifying carbon dioxide leakage from a geologic sequestration site has been gaining increasing attention as the public is interested in whether carbon dioxide is stored underground safely and effectively. A geologic sequestration site is to capture greenhouse gas and bury deep underground in the geological formations. Carbon dioxide leakage from a geologic sequestration site can be detected indirectly by planet stress response with the aid of remote sensing and an unsupervised clustering algorithm ( Iterative Self-Organizing Data Analysis Technique (ISODATA) method). The increase in soil CO₂ concentration causes a stress response for the plants by inhibiting plant respiration as oxygen is displaced away by carbon dioxide. The stress signal by the vegetation can be detected with the Red Edge Index (REI). The hyperspectral images are processed by the unsupervised algorithm clustering pixels with similar plant responses. The hyperspectral information in areas with known CO₂ leakage was extracted so that areas with CO₂ leakage can be matched with the clustered pixels with spectral anomalies. Although the approach can identify CO₂ leakage efficiently, there are some limitations that require further study. The Red Edge Index (REI) may not be accurate due to reasons like higher chlorophyll absorption, variation in vegetation, and shadowing effects therefore some stressed pixels were incorrectly identified as healthy pixels. Seasonality, groundwater table height may also affect the stress response to CO₂ of the vegetation.

Quantification of water inflow

The Rock Mass Rating (RMR) System a world-wide adopted rock mass classification system by geomechanical means with the input of six parameters. The amount of water inflow is one of the inputs of the classification scheme, representing the groundwater condition. Quantification of the water inflow in the faces of a rock tunnel was traditionally carried out by visual observation in the field, which is labour and time consuming with safety concerns. Machine learning can determine the water inflow by analyzing images taken in the construction site. The classification of the approach mostly follows the RMR system but combining damp and wet state as its difficult to distinguish only by visual inspection. The images were classified into the non-damage state, wet state, dripping state, flowing state and gushing state. The accuracy of classifying the images was about 90%.

Classification

Soil classification

The most popular cost-effective method for soil investigation method is by Cone Penetration Testing (CPT). The test is carried out by pushing a metallic cone through the soil and the force required to push at a constant rate is recorded as a quasi-continuous log. Machine learning can classify soil with the input of Cone Penetration Test log data. In an attempt to classify with machine learning, there are two parts of tasks required to analyse the data, which are the segmentation and classification parts. The segmentation part can be carried out with the Constraint Clustering and Classification (CONCC) algorithm to split a single series data into segments. The classification part can be carried out by Decision Trees (DT), Artificial Neural Network (ANN), or Support Vector Machine (SVM). While comparing the three algorithms, it is demonstrated that the Artificial Neural Network (ANN) performed the best in classifying humous clay and peat, while the Decision Trees performed the best in classifying clayey peat. The classification by this method is able to reach very high accuracy, even for the most complex problem, its accuracy was 83%, and the incorrectly classified class was a geologically neighbouring one. Considering the fact that such accuracy is sufficient for most experts, therefore the accuracy of such approach can be regarded as 100%.

Geological structure classification

Exposed geological structures like anticline, ripple marks, xenolith, scratch,

ptygmatic folds In structural geology, a fold is a stack of originally planar surfaces, such as sedimentary strata, that are bent or curved during permanent deformation. Folds in rocks vary in size from microscopic crinkles to mountain-sized folds. They occur a ...

, fault, concretion, mudcracks, gneissose, boudin,

basalt column Basalt (; ) is an aphanitic (fine-grained) extrusive igneous rock formed from the rapid cooling of low-viscosity lava rich in magnesium and iron (mafic lava) exposed at or very near the surface of a rocky planet or moon. More than 90% of a ...

s and

dike Dyke (UK) or dike (US) may refer to: General uses * Dyke (slang), a slang word meaning "lesbian" * Dike (geology), a subvertical sheet-like intrusion of magma or sediment * Dike (mythology), ''Dikē'', the Greek goddess of moral justice * Dikes, ...

can be identified automatically with a

deep learning Deep learning (also known as deep structured learning) is part of a broader family of machine learning methods based on artificial neural networks with representation learning. Learning can be supervised, semi-supervised or unsupervised. De ...

model. Research demonstrated that Three-layer Convolutional Neural Network (CNN) and

Transfer Learning Transfer learning (TL) is a research problem in machine learning (ML) that focuses on storing knowledge gained while solving one problem and applying it to a different but related problem. For example, knowledge gained while learning to recognize ...

have great accuracy of about 80% and 90% respectively, while others like

K-nearest neighbors In statistics, the ''k''-nearest neighbors algorithm (''k''-NN) is a non-parametric supervised learning method first developed by Evelyn Fix and Joseph Hodges in 1951, and later expanded by Thomas Cover. It is used for classification and reg ...

(KNN), Artificial Neural Network (ANN) and Extreme Gradient Boosting (XGBoost) have low accuracies, ranges from 10% - 30%. The grayscale images and colour images were both tested, and the accuracies difference is little, inferring that colour is not very important in identifying geological structures.

Forecast and predictions

Earthquake early warning systems and forecasting

Earthquake early warning systems are often vulnerable to local impulsive noise, therefore giving out false alerts. False alerts can be eliminated by discriminating the earthquake waveforms from noise signals with the aid of machine learning methods. The method consists of two parts, the first part is unsupervised learning with Generative Adversarial Network (GAN) to learn and extract features of first arrival P-waves, and Random Forest to discriminate P-waves. The approach achieved 99.2% in recognizing P-waves and can avoid false triggers by noise signals with 98.4% accuracy. Laboratory earthquakes are produced in a laboratory setting to mimic real-world earthquakes. With the help of machine learning, the patterns of acoustical signals as precursors for earthquakes can be identified without the need of manually searching. Predicting the time remaining before failure was demonstrated in a research with continuous acoustic time series data recorded from a fault. The algorithm applied was Random Forest trained with about 10 slip events and performed excellently in predicting the remaining time to failure. It identified acoustic signals to predict failures, and one of them was previously unidentified. Although this laboratory earthquake produced is not as complex as that of earth, this makes important progress that guides further earthquake prediction work in the future.

Streamflow discharge prediction

Real-time streamflow data is integral for decision making, for example, evacuations, regulation of reservoir water levels during a flooding event. Streamflow data can be estimated by information provided by streamgages which measures the water level of a river. However, water and debris from a flooding event may damage streamgages and essential real-time data will be missing. The ability of machine learning to infer missing data enables it to predict streamflow with both historical streamgages data and real-time data. SHEM is a model that refers to Streamflow Hydrology Estimate using Machine Learning that can serve the purpose. To verify its accuracies, the prediction result was compared with the actual recorded data and the accuracies were found to be between 0.78 to 0.99.

Challenge

Inadequate training data

An adequate amount of training and validation data is required for machine learning. However, some very useful products like satellite remote sensing data only have decades of data since the 1970s. If one is interested in the yearly data, then only less than 50 samples are available. Such amount of data may not be adequate. In a study of automatic classification of geological structures, the weakness of the model is the small training dataset, even though with the help of data augmentation to increase the size of the dataset. Another study of predicting streamflow found that the accuracies depend on the availability of sufficient historical data, therefore sufficient training data determine the performance of machine learning. Inadequate training data may lead to a problem called overfitting. Overfitting causes inaccuracies in machine learning as the model learns about the noise and undesired details.

Limited by data input

Machine learning cannot carry out some of the tasks as a human does easily. For example, in the quantification of water inflow in rock tunnel faces by images for Rock Mass Rating system (RMR), the damp and the wet state was not classified by machine learning because discriminating the two only by visual inspection is not possible. In some tasks, machine learning may not able to fully substitute manual work by a human.

Black-box operation

In many machine learning algorithms, for example, Artificial Neural Network (ANN), it is considered as ' black box' approach as clear relationships and descriptions of how the results are generated in the hidden layers are unknown. 'White-box' approach such as decision tree can reveal the algorithm details to the users. If one wants to investigate the relationships, such 'black-box' approaches are not suitable. However, the performances of 'black-box' algorithms are usually better.

References

{{Reflist Machine learning Geological techniques