Data extraction and spatial / 3D representation from BGS borehole data in AGS format with Python

Open lithology datasets are scarce and most times they come in certain exchange formats that can´t be easily coupled with other softwares. Python has the tools to read and extract data from those files and provide them in more friendly formats as csv or xlsx. With the use of spatial libraries as Fiona or 3D visualization libraries as Pyvistas we can go one step further and process our data as shapefiles or vtks.

For this tutorial you have to create a geospatial conda environment following the steps of this link.

Tutorial

Python code

The code is divided in two parts:


Data formating and spatial representation

import pandas as pd
import numpy as np
import re
import fiona
#get empty lines
agsData = open('../inputData/Queen_Mary_Reservoir.ags').readlines()

#empty dict to store the line groups
groupDict = {}

for index, line in enumerate(agsData):
    if line.startswith('"GROUP"'):
        #terms 
        terms = line.split(",")
        #finding the last line
        for index2, line2 in enumerate(agsData[index:]):
            if line2 == "\n":
                lineList = [index+1, index+index2-1]
                print(lineList)
                #adding the secuence to the dict omiting the " and \n
                groupDict[terms[1][1:-2]]=lineList
                #break the inner for
                break
[1, 4]
[7, 19]
[22, 27]
[30, 33]
[36, 47]
[50, 55]
[58, 1164]
[1167, 1436]
[1439, 1689]
#show line intervals
groupDict
{'PROJ': [1, 4],
 'ABBR': [7, 19],
 'DICT': [22, 27],
 'TRAN': [30, 33],
 'TYPE': [36, 47],
 'UNIT': [50, 55],
 'GEOL': [58, 1164],
 'LOCA': [1167, 1436],
 'WSTG': [1439, 1689]}
#create empty dataframe for geological information
geolDf = pd.DataFrame(columns=['LOCA_ID', 'GEOL_TOP', 'GEOL_BASE', 'GEOL_DESC'])
#counter for the dataframe index
i = 0
#interpret the numerical data as numbers
for line in agsData[groupDict['GEOL'][0]:groupDict['GEOL'][1]]:
    #get rid of the " 
    line = re.sub('"','',line)
    #split the line in columns
    lineItems = line.split(",")
    #working only with the "DATA" rows
    if lineItems[0] == 'DATA':
        #interpret numbers as float or int
        for index,item in enumerate(lineItems):
            try:
                if "." in item:
                    lineItems[index]=float(item)
                else:
                    lineItems[index]=int(item)
            except ValueError:
                pass
        #add first four columns to dataframe
        geolDf.loc[i] = lineItems[1:5]
        #add counter
        i += 1
#review the resulting dataframe
geolDf.head()
LOCA_ID GEOL_TOP GEOL_BASE GEOL_DESC
0 572547 0.00 1.22 Soil
1 572547 1.22 2.44 Dirty ballast
2 572547 2.44 8.84 Clean yellow sandy ballast
3 572547 8.84 11.89 London Clay (slightly sandy)
4 572548 0.00 0.30 Soil
#change id to integer
geolDf['LOCA_ID'] = pd.to_numeric(geolDf['LOCA_ID'], downcast="integer")

#check datatypes
geolDf.dtypes
LOCA_ID        int32
GEOL_TOP     float64
GEOL_BASE    float64
GEOL_DESC     object
dtype: object
#get unique values of lithology
litoArray = geolDf["GEOL_DESC"].unique()
litoArray
array(['Soil', 'Dirty ballast', 'Clean yellow sandy ballast',
       'London Clay (slightly sandy)', 'London Clay (good)',
       'Yellow loam', 'Dirty loamy ballast', 'Loam (yellow)',
       'Running sand (blue)', 'London Clay (sandy)', 'Blue running sand',
       'London Clay (very sandy)', 'Blue running sand and silt',
       'London Clay (very sandy). Stood to bore without driving tubing',
       'Loam', 'London Clay (rather sandy)', 'Clean grey sandy ballast',
       'Clean yellow sand', 'Clean yellow sandy', 'Black loam', 'Ballast',
       'Clay', 'Running sand', 'Clean sandy ballast',
       'Grey Thames ballast', 'Grey sandy silt', 'Hard blue sand',
       'Clean Thames ballast', 'Blue clay', 'Clean red sandy ballast',
       'Loamy ballast', 'Clean grey ballast', 'Brown loam'], dtype=object)
for lito in litoArray:
    print('%s has %d records'%
          (lito,geolDf[geolDf["GEOL_DESC"]==lito].shape[0]))
Soil has 224 records
Dirty ballast has 219 records
Clean yellow sandy ballast has 220 records
London Clay (slightly sandy) has 52 records
London Clay (good) has 142 records
Yellow loam has 48 records
Dirty loamy ballast has 2 records
Loam (yellow) has 11 records
Running sand (blue) has 1 records
London Clay (sandy) has 9 records
Blue running sand has 36 records
London Clay (very sandy) has 33 records
Blue running sand and silt has 1 records
London Clay (very sandy). Stood to bore without driving tubing has 1 records
Loam has 3 records
London Clay (rather sandy) has 13 records
Clean grey sandy ballast has 19 records
Clean yellow sand has 1 records
Clean yellow sandy has 1 records
Black loam has 2 records
Ballast has 19 records
Clay has 18 records
Running sand has 2 records
Clean sandy ballast has 5 records
Grey Thames ballast has 5 records
Grey sandy silt has 1 records
Hard blue sand has 1 records
Clean Thames ballast has 1 records
Blue clay has 1 records
Clean red sandy ballast has 3 records
Loamy ballast has 1 records
Clean grey ballast has 1 records
Brown loam has 7 records
#create column with code other for storing lito codes
geolDf['litoCode'] = int(7)

#define lito dict for reclass
litoDict = {
    'soil':1,
    'dirty ballast':2,
    'sandy ballast':3,
    'london clay':4,
    'loam':5,
    'sand':6,
    'other':7
}
for index, row in geolDf.iterrows():
    for lito in litoDict:
        if lito in row["GEOL_DESC"].lower():
            geolDf.loc[index,'litoCode'] = litoDict[lito]
            break
geolDf.tail()
LOCA_ID GEOL_TOP GEOL_BASE GEOL_DESC litoCode
1098 575212 1.83 7.92 Clean yellow sandy ballast 3
1099 575212 7.92 9.14 Blue running sand 6
1100 575212 9.14 12.19 London Clay (very sandy) 4
1101 575213 0.00 0.30 Soil 1
1102 575213 0.30 4.88 Ballast 7
#check the amount of reclassified records

#get unique values of lithology code
litoCodes = geolDf["litoCode"].unique()
print(litoCodes)

for lito in litoCodes:
    print('%s has %d records'%
          (lito,geolDf[geolDf["litoCode"]==lito].shape[0]))
[1 2 3 4 5 6 7]
1 has 224 records
2 has 219 records
3 has 247 records
4 has 250 records
5 has 74 records
6 has 44 records
7 has 45 records
#set LOCA_ID as Index
geolDf = geolDf.set_index('LOCA_ID')

geolDf.head()
GEOL_TOP GEOL_BASE GEOL_DESC litoCode
LOCA_ID
572547 0.00 1.22 Soil 1
572547 1.22 2.44 Dirty ballast 2
572547 2.44 8.84 Clean yellow sandy ballast 3
572547 8.84 11.89 London Clay (slightly sandy) 4
572548 0.00 0.30 Soil 1
#export lito to csv
geolDf.to_csv('../Csv/litoData.csv')

Working with borehole location

groupDict['LOCA']
[1167, 1436]
#create empty dataframe for geological information
#boreholeid, easting, northing, groundelevation
locaDf = pd.DataFrame(columns=['LOCA_ID', 'LOCA_NATE', 'LOCA_NATN', 'LOCA_GL'])
#counter for the dataframe index
i = 0
#interpret the numerical data as numbers
for line in agsData[groupDict['LOCA'][0]:groupDict['LOCA'][1]]:
    #get rid of the " 
    line = re.sub('"','',line)
    #split the line in columns
    lineItems = line.split(",")
    #working only with the "DATA" rows
    if lineItems[0] == 'DATA':
        #interpret numbers as float or int
        for index,item in enumerate(lineItems):
            try:
                if "." in item:
                    lineItems[index]=float(item)
                else:
                    lineItems[index]=int(item)
            except ValueError:
                pass
        #add first four columns to dataframe
        #print([lineItems[1], lineItems[4], lineItems[5], lineItems[7]])
        locaDf.loc[i] = [int(lineItems[1]), lineItems[4], lineItems[5], lineItems[7]]
        #add counter
        i += 1
#change id to integer
locaDf['LOCA_ID'] = pd.to_numeric(locaDf['LOCA_ID'], downcast="integer")

#check datatypes
locaDf.dtypes
LOCA_ID        int32
LOCA_NATE    float64
LOCA_NATN    float64
LOCA_GL      float64
dtype: object
#set ID as index
locaDf = locaDf.set_index('LOCA_ID')
locaDf.head()
LOCA_NATE LOCA_NATN LOCA_GL
LOCA_ID
572547 506930.0 169160.0 13.21
572548 506930.0 169250.0 12.91
572549 506950.0 169340.0 12.58
572550 506960.0 169430.0 12.25
572551 506970.0 169520.0 12.38
#create point shapefile

# define schema
schema = {
    'geometry':'Point',
    'properties':[('Id','int'),('Easting','float'),('Northing','float'),('Elevation','float')]
}

#open a fiona object
pointShp = fiona.open('../Shp/boreholePoints.shp', mode='w', driver='ESRI Shapefile',
          schema = schema, crs = "EPSG:27700")
#iterate over each row in the dataframe and save record
for index, row in locaDf.iterrows():
    rowDict = {
        'geometry' : {'type':'Point',
                     'coordinates': (row.LOCA_NATE,row.LOCA_NATN)},
        'properties': {'Id' : index,
                      'Easting':row.LOCA_NATE,
                      'Northing':row.LOCA_NATN,
                      'Elevation':row.LOCA_GL},
    }
    pointShp.write(rowDict)
#close fiona object
pointShp.close()
#export dataframe as csv
locaDf.to_csv('../Csv/locationData.csv')

Elevation calculation and Vtk representation

import os
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
#!pip install pyvista
import pyvista as pv

Import data

#import well location
wellLoc = pd.read_csv('../Csv/locationData.csv', index_col=0)
wellLoc.head()
LOCA_NATE LOCA_NATN LOCA_GL
LOCA_ID
572547 506930.0 169160.0 13.21
572548 506930.0 169250.0 12.91
572549 506950.0 169340.0 12.58
572550 506960.0 169430.0 12.25
572551 506970.0 169520.0 12.38
#import borehole litology
wellLito = pd.read_csv('../Csv/litoData.csv')
wellLito.tail()
LOCA_ID GEOL_TOP GEOL_BASE GEOL_DESC litoCode
1098 575212 1.83 7.92 Clean yellow sandy ballast 3
1099 575212 7.92 9.14 Blue running sand 6
1100 575212 9.14 12.19 London Clay (very sandy) 4
1101 575213 0.00 0.30 Soil 1
1102 575213 0.30 4.88 Ballast 7

Process elevations

#create empty columns
wellLito['elevTop'] = -10000
wellLito['elevBot'] = -10000

for index, row in wellLito.iterrows():
    try:
        surfRow = wellLoc.loc[row.LOCA_ID]
        surfElev = surfRow.LOCA_GL
        wellLito.loc[index,'elevTop'] = surfElev - row.GEOL_TOP
        wellLito.loc[index,'elevBot'] = surfElev - row.GEOL_BASE
    except KeyError:
        wellLito = wellLito.drop(index)
#check well lito and export as csv
wellLito.head()
wellLito.to_csv('../Csv/litoDataElevations.csv')

Vtk representation

#generation of list arrays for the vtk
offsetList = []
linSec = []
linVerts = []

i=0
for index, values in wellLito.iterrows():
    x, y =wellLoc.loc[values.LOCA_ID][['LOCA_NATE','LOCA_NATN']]
    cellVerts = [[x,y,values.elevTop],[x,y,values.elevBot]]
    offsetList.append(i*3)         
    linSec = linSec + [2,2*i,2*i+1]
    linVerts = linVerts + cellVerts
    i +=1

offsetArray = np.array(offsetList)
linArray = np.array(linSec)
cellType = np.ones([i])*3
vertArray = np.array(linVerts)
# create the unstructured grid and assign lito code
grid = pv.UnstructuredGrid(linArray, cellType, vertArray)
grid.cell_data["values"] = wellLito.litoCode.values
grid.save('../Vtk/conceptualizedLito.vtu',binary=False)

Input data

You can download the input data from this link.

Comment

Saul Montoya

Saul Montoya es Ingeniero Civil graduado de la Pontificia Universidad Católica del Perú en Lima con estudios de postgrado en Manejo e Ingeniería de Recursos Hídricos (Programa WAREM) de la Universidad de Stuttgart con mención en Ingeniería de Aguas Subterráneas y Hidroinformática.

 

Suscribe to our online newsletter

Subscribe for free newsletter, receive news, interesting facts and dates of our courses in water resources.