Home » Python For Petroleum Tutorials » How to Predict Oil Reservoir Properties Using Machine Learning in Python

How to Predict Oil Reservoir Properties Using Machine Learning in Python

Predicting oil reservoir properties is a crucial task in the oil and gas industry, enabling engineers and geoscientists to make informed decisions about exploration and production. With the advent of machine learning, this task has become more efficient and accurate. In this tutorial, we will explore how to leverage machine learning techniques in Python to predict oil reservoir properties.

We will walk you through the process of loading and preprocessing data, building and training machine learning models, and making predictions using Linear Regression and Random Forest algorithms. Additionally, we will create a user-friendly graphical interface using the tkinter library to facilitate data upload, model training, and property prediction. By the end of this tutorial, you’ll have a practical understanding of how to apply machine learning to solve real-world problems in the oil and gas sector. Let’s dive in!

Table of Contents

Purpose and Benefits of the Code for Petroleum Engineers

Our goal with today’s article is to create a powerful script that will revolutionize how petroleum engineers predict reservoir properties. This script uploads and loads well-log data from a CSV file, which contains crucial measurements such as Gamma Ray (GR), Deep Resistivity (ILD), Porosity (PHI), Bulk Density (RHOB), Neutron Porosity (NPHI), True Resistivity (TR), and a target property we aim to predict, such as Permeability or Saturation.

Here’s how this code can help a petroleum engineer:

Data Handling and Preparation

The code begins by loading a CSV file containing well-log data and the target property, essential for understanding subsurface conditions. Key features include:

  • Gamma Ray (GR): Measures the natural radioactivity of the formation, helping identify rock types.
  • Deep Resistivity (ILD): Indicates the presence of hydrocarbons versus water.
  • Porosity (PHI): Shows the percentage of the rock’s volume that can store fluids.
  • Bulk Density (RHOB): Provides information about the density of the formation.
  • Neutron Porosity (NPHI): Measures the hydrogen content, which correlates with porosity.
  • True Resistivity (TR): Another measure to differentiate between hydrocarbon and water zones.

Model Training and Evaluation

The script trains two machine learning models, Linear Regression and Random Forest, using this well-log data to predict a target property. Potential target properties include:

  • Permeability: Indicates the ability of the reservoir rock to transmit fluids, crucial for understanding how easily oil can flow.
  • Saturation: Measures the proportion of pore volume occupied by oil, gas, or water, helping determine the volume of recoverable hydrocarbons.

To ensure reliability, the script evaluates the models using Mean Absolute Error (MAE), Mean Squared Error (MSE), and R-squared (R²) score, providing insights into the models’ accuracy.

Practical Application

After training and evaluation, the script can predict the target property for new input data. Petroleum engineers can input well-log data (GR, ILD, PHI, RHOB, NPHI, TR) into the graphical interface, and the script will output the predicted property. This capability allows engineers to make informed decisions about:

  • Reservoir Quality: Assessing whether a formation can produce hydrocarbons economically.
  • Production Planning: Estimating recoverable oil volumes for efficient resource management.
  • Risk Mitigation: Reducing uncertainty in subsurface evaluations, leading to better investment decisions and reduced financial risk.

In summary, this code provides petroleum engineers with a robust tool to predict vital reservoir properties accurately and efficiently. Leveraging machine learning, engineers can enhance their understanding of the subsurface, optimize production strategies, and make more informed decisions in exploration and production activities.

Getting Started

Make sure to install these libraries for the code to function properly:

$ pip install numpy
$ pip install pandas
$ pip install matplotlib
$ pip install seaborn
$ pip install scikit-learn
$ pip install tk 

Imports

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from tkinter import *
from tkinter import filedialog, messagebox
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

As you may already know, every mission or adventure needs its own soldiers. So, let’s meet the ones who are going to accomplish today’s mission:

  • numpy: Handles arrays and matrices, as well as mathematical functions.
  • pandas: Manipulates data and stores it in table form.
  • matplotlib: Creates static and interactive plots.
  • seaborn: Provides a high-level interface for drawing attractive and informative statistical graphics.
  • tkinter: Creates a graphical user interface, accesses directories using filedialog, and displays pop-up messages with messagebox.
  • Scikit-learn: A machine learning library that:
    • Splits data into training and testing sets using train_test_split().
    • Implements Linear Regression and Random Forest algorithms with LinearRegression and RandomForestRegressor.
    • Evaluates machine learning models using mean_absolute_error, mean_squared_error, and r2_score.

Global Variables

Now that we have assembled our libraries, we need to gather trusted companions for this mission that we can call upon anytime we want. So, without further ado, let us meet them:

  • data: A variable to store our dataset after it’s loaded.
  • X: Since we need to train our model, we need input data, which we store in this variable.
  • y: To predict a target variable, we create this variable to store the output.
  • X_train and X_test: Variables to store the input data for training and testing, respectively.
  • y_train and y_test: Variables to store the output data for training and testing, respectively.
  • lr_model and rf_model: Variables for the Linear Regression and Random Forest models.
# Global variables
data = None
X = None
y = None
X_train = None
X_test = None
y_train = None
y_test = None
lr_model = None
rf_model = None

Loading and Uploading Data

Here comes the next step in our code: to read and unveil the mysteries hidden in our CSV file. We achieve this with the load_data() function, which reads the CSV file at the given path using pandas and loads it into a DataFrame. This DataFrame is then returned to be used by the rest of the code.

But before we can load the CSV file, we need to upload it. That’s where the upload_file() function comes in. It opens a file dialog that only shows CSV files. Once you select a valid file, this function assigns it to our global variable data and celebrates the success with a message box. After that, it calls the show_data() function to display the first few rows of the dataset and the preprocess_data() function to handle any missing values. If anything goes wrong during the upload, an error message will pop up to let you know.

# Function to load the dataset
def load_data(file):
   data = pd.read_csv(file)
   return data


# Function to upload a file
def upload_file():
   file = filedialog.askopenfilename(filetypes=[("CSV files", "*.csv")])
   if file:
       global data
       data = load_data(file)
       messagebox.showinfo("File Upload", "File uploaded successfully")
       show_data()
       preprocess_data()
   else:
       messagebox.showerror("File Upload Error", "Can You Please upload a valid CSV file")

Data Preprocessing and Visualization Functions

Just as we mentioned earlier, the previous function calls both the preprocess_data() and show_data() functions. So, it’s only natural that we dive into what each of them does:

  • First up is the preprocess_data() function. Think of it as our data cleaner. It uses data.ffill() to fill in any missing values with the last known value from above in the column, ensuring our dataset is complete. Once done, it updates our global variable data with this preprocessed version.
  • Next, we have the show_data() function. Imagine you’re opening a new window to take a quick peek at the data. This function does exactly that. It creates a new window using Toplevel(). Inside this new window, a text widget is created with Text(). Then, it inserts the first few rows of our dataset into this text widget using text.insert(), and finally, arranges the text widget neatly in the window using text.pack().
# Function to preprocess data
def preprocess_data():
   global data
   data = data.ffill()


# Function to show data
def show_data():
   if data is not None:
       top = Toplevel()
       text = Text(top)
       text.insert(INSERT, str(data.head()))
       text.pack()

Heatmap Visualization and Data Splitting Functions

Wondering about the hidden relationships in our data? The show_heatmap() function has got you covered. It calculates the correlations between columns using data.corr() and then creates a colorful, annotated heatmap with sns.heatmap(). Finally, it displays this visual feast with plt.show(). If the data isn’t loaded correctly, you’ll see a message box letting you know.

Now that we’ve got our data uploaded and loaded, it’s time to train and test our model. But first, we need to split the dataset. The split_data() function handles this perfectly. It looks for a column named target_property (the one we want to predict). If it finds it, it drops this column to focus on the features (stored in X). Then, it splits these features and the target values (y) into training and testing sets using an 80-20 split with the train_test_split() function. And of course, if the function can’t find the target_property column, it will display an error message to help you troubleshoot.

# Function to show correlation heatmap
def show_heatmap():
   if data is not None:
       plt.figure(figsize=(10, 8))
       sns.heatmap(data.corr(), annot=True, cmap='coolwarm')
       plt.show()
   else:
       messagebox.showerror("Data Error", "Can You Please upload a valid CSV file first")


# Function to split data
def split_data():
   global X, y, X_train, X_test, y_train, y_test
   if 'target_property' in data.columns:
       X = data.drop('target_property', axis=1)
       y = data['target_property']
       X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
       messagebox.showinfo("Data Split", "Data split into training and testing sets")
   else:
       messagebox.showerror("Data Error", "The CSV file must contain a 'target_property' column")

Model Training and Testing Functions

With our data successfully split, the next step is to train and evaluate our model. Let’s explore how our two key functions handle this: build_models() and evaluate_models():

  • First, the build_models() function. It sets up our models using LinearRegression and RandomForestRegressor. Then, it trains these models with our training data using the fit() method. If the training data isn’t available (from the split_data() function), it will show an error message. Once the models are trained, it calls our second function, evaluate_models().
  • Now, onto evaluate_models(). This function takes the testing data to generate predictions using the predict() method. It then evaluates these predictions with mean_absolute_error(), mean_squared_error(), and r2_score(). Finally, it formats the results into easy-to-read strings and displays them in a message box.
# Function to build models
def build_models():
   global lr_model, rf_model, X_train, y_train
   if X_train is not None and y_train is not None:
       lr_model = LinearRegression()
       rf_model = RandomForestRegressor()
       lr_model.fit(X_train, y_train)
       rf_model.fit(X_train, y_train)
       messagebox.showinfo("Model Training", "Models trained successfully")
       evaluate_models()
   else:
       messagebox.showerror("Model Training Error", "Please split the data first")


# Function to evaluate models
def evaluate_models():
   global X_test, y_test, lr_model, rf_model
   lr_predictions = lr_model.predict(X_test)
   rf_predictions = rf_model.predict(X_test)
   lr_mae = mean_absolute_error(y_test, lr_predictions)
   rf_mae = mean_absolute_error(y_test, rf_predictions)
   lr_mse = mean_squared_error(y_test, lr_predictions)
   rf_mse = mean_squared_error(y_test, rf_predictions)
   lr_r2 = r2_score(y_test, lr_predictions)
   rf_r2 = r2_score(y_test, rf_predictions)
   results = f"""
   Linear Regression MAE: {lr_mae:.4f}
   Random Forest MAE: {rf_mae:.4f}
   Linear Regression MSE: {lr_mse:.4f}
   Random Forest MSE: {rf_mse:.4f}
   Linear Regression R2 Score: {lr_r2:.4f}
   Random Forest R2 Score: {rf_r2:.4f}
   """
   messagebox.showinfo("Model Evaluation", results)

Property Prediction Function

Now that the model is trained and tested, we can move on to the main goal of our script: making predictions based on input data. How do we accomplish this? By using the predict_property() function.

Here’s how it works:

  • The function retrieves the user’s input from the entry widget using entry.get().
  • It then transforms this input into an array using np.array().
  • Next, it checks if the input array has the correct number of columns (matching the number of columns in the uploaded CSV file minus the target_property column). If the input is incorrect, an error message will be displayed.
  • Finally, using the predict() method, the function employs the random forest model to predict the target_property (whether it is Porosity, Saturation, etc.).
# Function to predict reservoir property
def predict_property():
   global X
   user_input = entry.get()
   try:
       input_array = np.array([float(i) for i in user_input.split(',')]).reshape(1, -1)
       if input_array.shape[1] != X.shape[1]:
           messagebox.showerror("Input Error", "Incorrect number of inputs. Please enter data for all features: GR, ILD, PHI, RHOB, NPHI, RT.")
           return
       prediction = rf_model.predict(input_array)
       messagebox.showinfo("Prediction", f"Predicted Property: {prediction[0]:.4f}")
   except Exception as e:
       messagebox.showerror("Input Error", f"Error in input data: {e}")

Main Window Setup

We have finally reached the grand finale, where we bring all the previous elements together into the graphical interface. First, we create the main window, set its title, and define its geometry. Then, we create buttons, each calling a specific function:

  • The “Upload CSV File” button calls the upload_file() function.
  • The “Show Correlation Heatmap” button calls the show_heatmap() function.
  • The “Split Data” button calls the split_data() function.
  • The “Train Models” button calls the build_models() function.
  • The “Predict” button calls the predict_property() function.
  • The “Show Data” button calls the show_data() function.

Next, we create a label to inform the user what to input, and an entry widget for the input.

# Main window setup
root = Tk()
root.title("Predict Oil Reservoir Properties Using Machine Learning - The Pycodes")
root.geometry("600x400")


upload_btn = Button(root, text="Upload CSV File", command=upload_file)
upload_btn.pack(pady=10)


heatmap_btn = Button(root, text="Show Correlation Heatmap", command=show_heatmap)
heatmap_btn.pack(pady=10)


split_btn = Button(root, text="Split Data", command=split_data)
split_btn.pack(pady=10)


train_btn = Button(root, text="Train Models", command=build_models)
train_btn.pack(pady=10)


label = Label(root, text="Enter well log data (comma-separated): GR, ILD, PHI, RHOB, NPHI, RT")
label.pack(pady=10)


entry = Entry(root, width=50)
entry.pack(pady=10)


predict_btn = Button(root, text="Predict", command=predict_property)
predict_btn.pack(pady=10)


show_data_btn = Button(root, text="Show Data", command=show_data)
show_data_btn.pack(pady=10)

Finally, we start the main event loop with the mainloop() method, which keeps the main window running and responsive to the user.

root.mainloop()

Example

So in simple terms, the CSV file consists of features (X) and a target property (Y). Our goal with this script is to try many models to predict the correct Y from the X as input. The X represents the features “GR, ILD, PHI, RHOB, NPHI, TR,” and the Y is the target property of the CSV file.

Once we have a model that accurately predicts the Y from the X in the CSV file, we can use that model to predict the target property with different input data.

Full Code

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from tkinter import *
from tkinter import filedialog, messagebox
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score


# Global variables
data = None
X = None
y = None
X_train = None
X_test = None
y_train = None
y_test = None
lr_model = None
rf_model = None


# Function to load the dataset
def load_data(file):
   data = pd.read_csv(file)
   return data


# Function to upload a file
def upload_file():
   file = filedialog.askopenfilename(filetypes=[("CSV files", "*.csv")])
   if file:
       global data
       data = load_data(file)
       messagebox.showinfo("File Upload", "File uploaded successfully")
       show_data()
       preprocess_data()
   else:
       messagebox.showerror("File Upload Error", "Can You Please upload a valid CSV file")


# Function to preprocess data
def preprocess_data():
   global data
   data = data.ffill()


# Function to show data
def show_data():
   if data is not None:
       top = Toplevel()
       text = Text(top)
       text.insert(INSERT, str(data.head()))
       text.pack()


# Function to show correlation heatmap
def show_heatmap():
   if data is not None:
       plt.figure(figsize=(10, 8))
       sns.heatmap(data.corr(), annot=True, cmap='coolwarm')
       plt.show()
   else:
       messagebox.showerror("Data Error", "Can You Please upload a valid CSV file first")


# Function to split data
def split_data():
   global X, y, X_train, X_test, y_train, y_test
   if 'target_property' in data.columns:
       X = data.drop('target_property', axis=1)
       y = data['target_property']
       X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
       messagebox.showinfo("Data Split", "Data split into training and testing sets")
   else:
       messagebox.showerror("Data Error", "The CSV file must contain a 'target_property' column")


# Function to build models
def build_models():
   global lr_model, rf_model, X_train, y_train
   if X_train is not None and y_train is not None:
       lr_model = LinearRegression()
       rf_model = RandomForestRegressor()
       lr_model.fit(X_train, y_train)
       rf_model.fit(X_train, y_train)
       messagebox.showinfo("Model Training", "Models trained successfully")
       evaluate_models()
   else:
       messagebox.showerror("Model Training Error", "Please split the data first")


# Function to evaluate models
def evaluate_models():
   global X_test, y_test, lr_model, rf_model
   lr_predictions = lr_model.predict(X_test)
   rf_predictions = rf_model.predict(X_test)
   lr_mae = mean_absolute_error(y_test, lr_predictions)
   rf_mae = mean_absolute_error(y_test, rf_predictions)
   lr_mse = mean_squared_error(y_test, lr_predictions)
   rf_mse = mean_squared_error(y_test, rf_predictions)
   lr_r2 = r2_score(y_test, lr_predictions)
   rf_r2 = r2_score(y_test, rf_predictions)
   results = f"""
   Linear Regression MAE: {lr_mae:.4f}
   Random Forest MAE: {rf_mae:.4f}
   Linear Regression MSE: {lr_mse:.4f}
   Random Forest MSE: {rf_mse:.4f}
   Linear Regression R2 Score: {lr_r2:.4f}
   Random Forest R2 Score: {rf_r2:.4f}
   """
   messagebox.showinfo("Model Evaluation", results)


# Function to predict reservoir property
def predict_property():
   global X
   user_input = entry.get()
   try:
       input_array = np.array([float(i) for i in user_input.split(',')]).reshape(1, -1)
       if input_array.shape[1] != X.shape[1]:
           messagebox.showerror("Input Error", "Incorrect number of inputs. Please enter data for all features: GR, ILD, PHI, RHOB, NPHI, RT.")
           return
       prediction = rf_model.predict(input_array)
       messagebox.showinfo("Prediction", f"Predicted Property: {prediction[0]:.4f}")
   except Exception as e:
       messagebox.showerror("Input Error", f"Error in input data: {e}")


# Main window setup
root = Tk()
root.title("Predict Oil Reservoir Properties Using Machine Learning - The Pycodes")
root.geometry("600x400")


upload_btn = Button(root, text="Upload CSV File", command=upload_file)
upload_btn.pack(pady=10)


heatmap_btn = Button(root, text="Show Correlation Heatmap", command=show_heatmap)
heatmap_btn.pack(pady=10)


split_btn = Button(root, text="Split Data", command=split_data)
split_btn.pack(pady=10)


train_btn = Button(root, text="Train Models", command=build_models)
train_btn.pack(pady=10)


label = Label(root, text="Enter well log data (comma-separated): GR, ILD, PHI, RHOB, NPHI, RT")
label.pack(pady=10)


entry = Entry(root, width=50)
entry.pack(pady=10)


predict_btn = Button(root, text="Predict", command=predict_property)
predict_btn.pack(pady=10)


show_data_btn = Button(root, text="Show Data", command=show_data)
show_data_btn.pack(pady=10)


root.mainloop()

Happy Coding!

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top
×