Predicting wellbore stability is essential for safe and efficient drilling operations. Accurate predictions help prevent costly and hazardous drilling issues. This tutorial will guide you through creating a wellbore stability prediction model using Python.
Today, you’ll learn how to preprocess data, handle class imbalance with SMOTE (Synthetic Minority Over-sampling Technique) learn more about SMOTE here, engineer features, perform hyperparameter tuning with Grid Search, and build an ensemble model with a Stacking Classifier. Additionally, you’ll create a user-friendly GUI with tkinter
for data input and result visualization. Let’s dive in and get started!
Table of Contents
- Necessary Libraries
- Imports
- Initializing the Main Window and Setting Up Its Elements
- Loading Data and Training the Model
- Predicting Wellbore Stability
- Showing the Confusion Matrix
- Showing the Rock Curve
- Showing the Classification Report
- Showing Feature Importance
- Creating GUI Components
- Example
- Full Code
Necessary Libraries
For the code to function properly, make sure to install these libraries by running the following commands:
$ pip install imblearn
$ pip install pandas
$ pip install scikit-learn
$ pip install imbalanced-learn
$ pip install matplotlib
$ pip install seaborn
$ pip install tk
Imports
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier, StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, precision_recall_curve, classification_report
from sklearn.feature_selection import SelectKBest, f_classif
from imblearn.over_sampling import SMOTE
import matplotlib.pyplot as plt
import seaborn as sns
import tkinter as tk
from tkinter import filedialog, messagebox
If you have been following our blog, you are likely familiar with the tool assembly line we always use when discussing library imports. Just because it’s repetitive doesn’t mean it’s not useful. Let’s meet our imports:
- pandas: For manipulating and analyzing data, especially CSV files.
- train_test_split: To split CSV data into training and testing sets.
- GridSearchCV: To find the best parameters for our models.
- StandardScaler: To standardize features by removing the mean and scaling to unit variance.
- PolynomialFeatures: To generate polynomial and interaction features.
- GradientBoostingClassifier: For classification tasks.
- RandomForestClassifier: To use multiple decision trees.
- StackingClassifier: To improve prediction accuracy by combining multiple models.
- LogisticRegression: For binary classification tasks.
- precision_recall_curve: To plot the precision-recall curve.
- classification_report: To provide a detailed classification report.
- SelectKBest: To perform statistical tests and select the top k features.
- f_classif: For classification tasks.
- SMOTE: To handle imbalanced datasets.
- matplotlib.pyplot: To plot graphs.
- seaborn: To provide a high-level interface for drawing statistical graphics.
- tkinter: To create a graphical user interface, allowing file selection through
filedialog
and displaying messages withmessagebox
.
Initializing the Main Window and Setting Up Its Elements
Usually, we start by defining the functions and then create an interface to gather them. However, this time we break that pattern. We begin by creating the main window, setting its title, and defining its geometry.
Then, we create two empty lists: labels
and entries
. Next, we define a list of features, which are the inputs for our model. For each feature, we create a label to identify it and an entry widget to input the values. Finally, we add these labels and entries to their respective lists using labels.append(label)
and entries.append(entry)
.
# Initialize the root Tkinter window
root = tk.Tk()
root.title("Wellbore Stability Prediction - The Pycodes")
root.geometry("400x300")
# Initialize empty lists for labels and entries
labels = []
entries = []
# Let us Create labels and entry widgets for each feature
feature_names = ["Depth", "GammaRay", "Resistivity", "Density", "Sonic"]
for i, feature in enumerate(feature_names):
label = tk.Label(root, text=feature)
label.grid(row=i, column=0)
labels.append(label)
entry = tk.Entry(root)
entry.grid(row=i, column=1)
entries.append(entry)
Loading Data and Training the Model
Since we’re using machine learning to predict wellbore stability, we need to train our model. The load_data()
function handles this.
First, it lets you pick a CSV file. Then, it reads the file into a DataFrame and separates the data into features and the target. We use SMOTE to balance the data by creating new synthetic samples for the minority class. Polynomial features are added to capture complex patterns, and we select the top features using SelectKBest
.
Next, we split the data into training and testing sets and standardize the features. We initialize three models: GradientBoostingClassifier
, RandomForestClassifier
, and LogisticRegression
. We tune their hyperparameters with GridSearchCV
to find the best models. Lastly, we use StackingClassifier
to combine these models for better accuracy. After training, a success message pops up.
def load_data():
global df, features, target, X_train, X_test, y_train, y_test, scaler, poly, selector, stacking_model, best_gb_model, best_rf_model
file_path = filedialog.askopenfilename(filetypes=[("CSV files", "*.csv")])
if not file_path:
return
df = pd.read_csv(file_path)
# Define the features and the target
features = df.drop(columns=['WellboreStability'])
target = df['WellboreStability']
# Handle class imbalance with SMOTE
smote = SMOTE(random_state=42)
features_res, target_res = smote.fit_resample(features, target)
# Feature Engineering: Adding Polynomial Features
poly = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
features_poly = poly.fit_transform(features_res)
# Feature Selection
selector = SelectKBest(score_func=f_classif, k=20)
features_selected = selector.fit_transform(features_poly, target_res)
# Now We Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features_selected, target_res, test_size=0.2, random_state=42)
# Standardize the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Initialize models
gb_model = GradientBoostingClassifier(random_state=42)
rf_model = RandomForestClassifier(random_state=42)
lr_model = LogisticRegression(random_state=42)
# Hyperparameter tuning using Grid Search
param_grid_gb = {
'n_estimators': [100, 200],
'learning_rate': [0.01, 0.1],
'max_depth': [3, 5]
}
param_grid_rf = {
'n_estimators': [100, 200],
'max_depth': [None, 10],
'min_samples_split': [2, 5]
}
grid_search_gb = GridSearchCV(gb_model, param_grid_gb, cv=5, scoring='accuracy')
grid_search_rf = GridSearchCV(rf_model, param_grid_rf, cv=5, scoring='accuracy')
grid_search_gb.fit(X_train, y_train)
grid_search_rf.fit(X_train, y_train)
best_gb_model = grid_search_gb.best_estimator_
best_rf_model = grid_search_rf.best_estimator_
# Ensemble model using Stacking Classifier
stacking_model = StackingClassifier(estimators=[
('gb', best_gb_model),
('rf', best_rf_model)
], final_estimator=lr_model, cv=5)
# Train the ensemble model
stacking_model.fit(X_train, y_train)
messagebox.showinfo("Data Load", "Data loaded and model trained successfully!")
Predicting Wellbore Stability
After training our model, the next step is making predictions, and that’s where the predict()
function comes in. It starts by gathering the inputs from the user, converting them to floats, and storing them in input_data
.
Next, it transforms these inputs into polynomial features and selects the best ones. The data is then standardized to prepare it for prediction. Using the stacking_model
, the function predicts whether the wellbore is stable or unstable (0 for stable, 1 for unstable).
Ultimately, it displays the prediction result in a messagebox. If something goes wrong, like invalid inputs or an unloaded model, an error message pops up.
def predict():
try:
input_data = [float(entry.get()) for entry in entries]
input_poly = poly.transform([input_data])
input_selected = selector.transform(input_poly)
input_scaled = scaler.transform(input_selected)
prediction = stacking_model.predict(input_scaled)
result = "Stable" if prediction[0] == 1 else "Unstable"
messagebox.showinfo("Prediction Result", f"This wellbore is predicted to be: {result}")
except ValueError:
messagebox.showerror("Input Error", "Please enter valid numeric values for all features.")
except NameError:
messagebox.showerror("Model Error", "Please load the data first and train the model.")
Showing the Confusion Matrix
def show_confusion_matrix():
if 'stacking_model' in globals():
y_pred = stacking_model.predict(X_test)
conf_matrix = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(10, 6))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', xticklabels=['Stable', 'Unstable'], yticklabels=['Stable', 'Unstable'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix for Wellbore Stability Prediction')
plt.show()
Time to see how well our model performs with the show_confusion_matrix()
function. First, it checks if stacking_model
is defined globally. Then, it makes predictions on the test data and computes the confusion matrix.
To visualize it, we create a new figure with plt.figure()
, and use sns.heatmap()
to draw the confusion matrix as a heatmap. We set the labels for the x and y axes, add a title, and finally display the plot with plt.show()
. This gives us a clear picture of our model’s performance.
Showing the Rock Curve
def show_roc_curve():
if 'stacking_model' in globals():
y_pred_proba = stacking_model.predict_proba(X_test)[:, 1]
precision, recall, _ = precision_recall_curve(y_test, y_pred_proba)
plt.figure(figsize=(10, 6))
plt.plot(recall, precision, marker='.')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.show()
With the show_roc_curve()
function, we can see precision in action. It predicts class probabilities and computes precision-recall pairs for different threshold values. Using plt.plot()
, it draws the precision-recall curve, adds a title, and labels the x and y axes. Finally, it displays the curve with plt.show()
. This visual representation helps us understand how well our model distinguishes between classes and performs across different thresholds.
Showing the Classification Report
Now comes the final verdict on our model’s performance, handled by the show_classification_report()
function. It generates a detailed classification report using classification_report()
and then displays it in a message box. This report gives us a comprehensive overview of our model’s accuracy, precision, recall, and F1 scores, helping us understand how well the model performs across different metrics.
def show_classification_report():
if 'stacking_model' in globals():
y_pred = stacking_model.predict(X_test)
report = classification_report(y_test, y_pred, target_names=['Unstable', 'Stable'])
messagebox.showinfo("Classification Report", report)
Showing Feature Importance
We finally meet the heroes of our story as the show_feature_importance()
function reveals which features had the most impact on our model. This is done through a plot that is displayed with plt.show()
.
def show_feature_importance():
if 'stacking_model' in globals():
feature_importances = best_gb_model.feature_importances_
features_list = selector.get_feature_names_out(poly.get_feature_names_out(features.columns))
plt.figure(figsize=(12, 8))
sns.barplot(x=feature_importances, y=features_list)
plt.xlabel('Feature Importance')
plt.ylabel('Feature')
plt.title('Feature Importance in Predicting Wellbore Stability')
plt.show()
Creating GUI Components
We’re almost done! Now, we add buttons to our graphical interface, each triggering a specific function. Here’s what we have:
- The “Load Data” button calls the
load_data()
function. - The “Predict” button calls the
predict()
function. - The “Show Confusion Matrix” button calls the
show_confusion_matrix()
function. - The “Show ROC Curve” button calls the
show_roc_curve()
function. - The “Show Classification Report” button calls the
show_classification_report()
function. - The “Show Feature Importance” button calls the
show_feature_importance()
function.
# Create GUI components
load_button = tk.Button(root, text="Load Data", command=load_data)
load_button.grid(row=len(feature_names), column=0, columnspan=2)
predict_button = tk.Button(root, text="Predict", command=predict)
predict_button.grid(row=len(feature_names) + 1, column=0, columnspan=2)
conf_matrix_button = tk.Button(root, text="Show Confusion Matrix", command=show_confusion_matrix)
conf_matrix_button.grid(row=len(feature_names) + 2, column=0, columnspan=2)
roc_curve_button = tk.Button(root, text="Show ROC Curve", command=show_roc_curve)
roc_curve_button.grid(row=len(feature_names) + 3, column=0, columnspan=2)
classification_report_button = tk.Button(root, text="Show Classification Report", command=show_classification_report)
classification_report_button.grid(row=len(feature_names) + 4, column=0, columnspan=2)
feature_importance_button = tk.Button(root, text="Show Feature Importance", command=show_feature_importance)
feature_importance_button.grid(row=len(feature_names) + 5, column=0, columnspan=2)
Lastly, we use the mainloop()
method to keep the main window running and responsive to user actions.
root.mainloop()
Example
Full Code
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier, StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, precision_recall_curve, classification_report
from sklearn.feature_selection import SelectKBest, f_classif
from imblearn.over_sampling import SMOTE
import matplotlib.pyplot as plt
import seaborn as sns
import tkinter as tk
from tkinter import filedialog, messagebox
# Initialize the root Tkinter window
root = tk.Tk()
root.title("Wellbore Stability Prediction - The Pycodes")
root.geometry("400x300")
# Initialize empty lists for labels and entries
labels = []
entries = []
# Let us Create labels and entry widgets for each feature
feature_names = ["Depth", "GammaRay", "Resistivity", "Density", "Sonic"]
for i, feature in enumerate(feature_names):
label = tk.Label(root, text=feature)
label.grid(row=i, column=0)
labels.append(label)
entry = tk.Entry(root)
entry.grid(row=i, column=1)
entries.append(entry)
def load_data():
global df, features, target, X_train, X_test, y_train, y_test, scaler, poly, selector, stacking_model, best_gb_model, best_rf_model
file_path = filedialog.askopenfilename(filetypes=[("CSV files", "*.csv")])
if not file_path:
return
df = pd.read_csv(file_path)
# Define the features and the target
features = df.drop(columns=['WellboreStability'])
target = df['WellboreStability']
# Handle class imbalance with SMOTE
smote = SMOTE(random_state=42)
features_res, target_res = smote.fit_resample(features, target)
# Feature Engineering: Adding Polynomial Features
poly = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
features_poly = poly.fit_transform(features_res)
# Feature Selection
selector = SelectKBest(score_func=f_classif, k=20)
features_selected = selector.fit_transform(features_poly, target_res)
# Now We Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features_selected, target_res, test_size=0.2, random_state=42)
# Standardize the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Initialize models
gb_model = GradientBoostingClassifier(random_state=42)
rf_model = RandomForestClassifier(random_state=42)
lr_model = LogisticRegression(random_state=42)
# Hyperparameter tuning using Grid Search
param_grid_gb = {
'n_estimators': [100, 200],
'learning_rate': [0.01, 0.1],
'max_depth': [3, 5]
}
param_grid_rf = {
'n_estimators': [100, 200],
'max_depth': [None, 10],
'min_samples_split': [2, 5]
}
grid_search_gb = GridSearchCV(gb_model, param_grid_gb, cv=5, scoring='accuracy')
grid_search_rf = GridSearchCV(rf_model, param_grid_rf, cv=5, scoring='accuracy')
grid_search_gb.fit(X_train, y_train)
grid_search_rf.fit(X_train, y_train)
best_gb_model = grid_search_gb.best_estimator_
best_rf_model = grid_search_rf.best_estimator_
# Ensemble model using Stacking Classifier
stacking_model = StackingClassifier(estimators=[
('gb', best_gb_model),
('rf', best_rf_model)
], final_estimator=lr_model, cv=5)
# Train the ensemble model
stacking_model.fit(X_train, y_train)
messagebox.showinfo("Data Load", "Data loaded and model trained successfully!")
def predict():
try:
input_data = [float(entry.get()) for entry in entries]
input_poly = poly.transform([input_data])
input_selected = selector.transform(input_poly)
input_scaled = scaler.transform(input_selected)
prediction = stacking_model.predict(input_scaled)
result = "Stable" if prediction[0] == 1 else "Unstable"
messagebox.showinfo("Prediction Result", f"This wellbore is predicted to be: {result}")
except ValueError:
messagebox.showerror("Input Error", "Please enter valid numeric values for all features.")
except NameError:
messagebox.showerror("Model Error", "Please load the data first and train the model.")
def show_confusion_matrix():
if 'stacking_model' in globals():
y_pred = stacking_model.predict(X_test)
conf_matrix = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(10, 6))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', xticklabels=['Stable', 'Unstable'], yticklabels=['Stable', 'Unstable'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix for Wellbore Stability Prediction')
plt.show()
def show_roc_curve():
if 'stacking_model' in globals():
y_pred_proba = stacking_model.predict_proba(X_test)[:, 1]
precision, recall, _ = precision_recall_curve(y_test, y_pred_proba)
plt.figure(figsize=(10, 6))
plt.plot(recall, precision, marker='.')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.show()
def show_classification_report():
if 'stacking_model' in globals():
y_pred = stacking_model.predict(X_test)
report = classification_report(y_test, y_pred, target_names=['Unstable', 'Stable'])
messagebox.showinfo("Classification Report", report)
def show_feature_importance():
if 'stacking_model' in globals():
feature_importances = best_gb_model.feature_importances_
features_list = selector.get_feature_names_out(poly.get_feature_names_out(features.columns))
plt.figure(figsize=(12, 8))
sns.barplot(x=feature_importances, y=features_list)
plt.xlabel('Feature Importance')
plt.ylabel('Feature')
plt.title('Feature Importance in Predicting Wellbore Stability')
plt.show()
# Create GUI components
load_button = tk.Button(root, text="Load Data", command=load_data)
load_button.grid(row=len(feature_names), column=0, columnspan=2)
predict_button = tk.Button(root, text="Predict", command=predict)
predict_button.grid(row=len(feature_names) + 1, column=0, columnspan=2)
conf_matrix_button = tk.Button(root, text="Show Confusion Matrix", command=show_confusion_matrix)
conf_matrix_button.grid(row=len(feature_names) + 2, column=0, columnspan=2)
roc_curve_button = tk.Button(root, text="Show ROC Curve", command=show_roc_curve)
roc_curve_button.grid(row=len(feature_names) + 3, column=0, columnspan=2)
classification_report_button = tk.Button(root, text="Show Classification Report", command=show_classification_report)
classification_report_button.grid(row=len(feature_names) + 4, column=0, columnspan=2)
feature_importance_button = tk.Button(root, text="Show Feature Importance", command=show_feature_importance)
feature_importance_button.grid(row=len(feature_names) + 5, column=0, columnspan=2)
root.mainloop()
Happy Coding!