We have presented a joint paper on real-time hydraulic simulation in the context of oil and gas well cementing at the ASME OMAE2019 38th International Conference on Ocean, Offshore & Arctic Engineering in Glasgow.

The paper "Real Time Cementing Hydraulics Simulations Bring Risk Down" (OMAE2019-95100) described the physical model and implementation of a real-time hydraulic simulator to support the well cementing operations. A summary of the performance challenges and solutions implemented was outlined. A case study of application on the simulator to detect and characterize losses in a real-life field case was presented to demonstrate the application of the simulator in production environment. 

The conclusion of the paper was that the performance of the simulator allows it to be run during the cementing job execution using the data acquired from the rig as it is coming in. The comparison of simulated vs. acquired pressures and flow rates can then identify significant issues such as losses to formation and characterize them. Making that identification during the job rather than in the post-job analysis allows the operators to put remediation measures in place much sooner and therefore more effectively. Moreover, with this additional information job design can be adjusted on the fly as it is being pumped to compensate for the losses and to achieve successful cement placement.

S. Pelipenko, N. C. Flamant, and S. C. Impey. "Real Time Cementing Hydraulics Simulations Bring Risk Down",  Proceedings of the ASME 2019 38th International Conference on Ocean, Offshore and Arctic Engineering, OMAE2019-95100. Presented at the OMAE2019 Conference in Glasgow, Scotland, UK, June 9 – 14, 2019.

The full paper is available from the ASME conference proceedings website

While current graphics processing units (GPUs) are able to offer at least an order of magnitude improvement in processing power over current central processing units (CPUs) that have been traditionally used to perform calculations accessing this performance is frequently less than trivial. This series of articles examines some of the issues that arise when trying to optimize the performance of code running on the GPU.

Kernel Occupancy

Intel Processor Die Map
Figure 1: Intel Processor Die Map(source: Intel)

A modern CPU, as shown in Figure 1, consists of a small number of identical, complex general purpose cores. Each of these cores is design to process a wide range of instructions and to execute a single thread of instructions at high speed. Much of the complexity of these cores is due to the fact that to improve speed the processing of each instruction is pipelined.

16 Stage CPU Pipeline
Figure 2: 16 Stage CPU Pipeline(source: Intel)

Each stage of the processing of an instruction is done by a separate section of the hardware, this allows many instructions to be processed at the same time rather than waiting for each one to finish before beginning the next, as shown in Figure 2 for a 16 stage pipeline. This approach does have drawbacks however. Chief amongst these is that it may not be possible to start processing the next instruction, for example if it depends on the result of an instruction still being processed or on data still being loaded. Modern CPUs allow for this by being able to look ahead in the list of upcoming instructions to find one that can be started, this is known as out of order execution. The drawback is that the hardware to do this is quite complex and reduces the number of transistors that can be allocated to the job of actually processing the instructions. This is just one example of how the general purpose nature of the CPU detracts from its ability to do a single job, in this case processing
arithmetical instructions.

Maxwell architecture block diagram
Figure 3: NVIDIA Maxwell Architecture Block Diagram(source: NVIDIA)

In contrast to a CPU a GPU consists of a very large number of identical, simple, in-order processing elements. This structure is shown in Figure 3. These elements are simple, efficient and small precisely because, unlike a core in a CPU, they are designed only to support a limited set of primarily arithmetical operations. The remaining functions required of the GPU are performed by a much smaller number of special purpose elements which are shared between the much larger numbers of processing elements. The processing elements are organised into groups each of which shares a number of special purpose elements and other resources forming what is commonly termed a “compute core”.

While the structure of a GPU is very different from that of a CPU many of the challenges in extracting the maximum performance from it remain the same. One of the primary challenges, as alluded to above when discussing the structure of a CPU, is to ensure that all of the processing elements are kept busy, the measure of which is known as kernel occupancy. With its very different structure the GPU uses a different strategy from the CPU to maximize this.

Workgroups and Work Items

To be able to extract the maximum performance from the GPU the structure of the problem solved by the GPU must mirror its structure. It must consist of a large number of identical tasks, or work items, each of which can be processed by one of the processing elements.

It is the nature of the problem that allows the GPU to adopt a different approach to handling cases where a processing element cannot for some reason execute the next instruction in the thread of instructions it is working on. Instead of including complex logic for looking ahead in the thread of instructions for an instruction it can execute the GPU leverages the fact that there are many work items to be performed and simply switches to another work item for which the next instruction may be executed.

As mentioned above a single processing element is not complete by itself but rather is part of a compute core. The processing elements of a compute core share resources and may switch between the work items assigned to the compute core, but not to work items assigned to other compute cores. The set of work items assigned to a single compute core is known as a work group.

One of the most significant factors in maximising kernel occupancy is therefore to ensure that the problem to be solved is split into the optimal number of work groups each containing the optimal number of work items. This is a non-trivial task and the correct values will vary depending on the hardware. The aim being to ensure that there are sufficient work groups to keep all the compute cores occupied and sufficient work items in each work group to keep all the processing elements in the compute cores occupied. Some of the principal factors to be considered when computing the size of the work groups are listed below:

  • The number of compute cores available on the device being used.
  • The number of processing elements per compute core on the device being used.
  • The number of registers available per compute core vs the number required by the program to be executed.
  • The amount of shared memory available to each compute core vs the amount required by the program to be executed.

General Purpose Graphical Processing Unit (GPGPU) computing offers a powerful and cost-effective way of dramatically improving the computational speed of computer simulations.

By parallelising the computations and utilizing the consumer graphics hardware now present in most modern computers, speed increases of one to two orders of magnitude can potentially be achieved when compared with the more common single-threaded CPU approach.

Achieving this is not without its challenges – we briefly outline what these  are  with a summary of how we would go about employing the approach for a particular problem and why we are uniquely suited to the task at hand.

Introduction to GPGPU

Recent years have seen increasing interest in the concept of GPGPU computing. This can be attributed to three main factors. The first is the increasing disparity between the floating point computing power provided by GPUs and CPUs as shown in the following diagrams:

Floating-Point Operations per Second and Memory Bandwidth for the CPU and GPU
Floating-Point Operations per Second and Memory Bandwidth for the CPU and GPU (source: Nvidia Cuda C Programming Guide version 4.2)

The theoretical floating point performance and memory transfer speed of the GPU is many times higher than that of CPUs. Combined with the relatively low price of the hardware and its wide availability in modern computers makes using GPGPU computing an attractive proposition. 

The second factor is that along with the increase in power there has been a shift in the design of GPUs, away from hardwired devices designed to perform a few specific graphical operations and towards massively parallel general purpose computing devices. This added programmability has allowed the computing power such GPUs offer to be harnessed for tasks other than graphics.

The final factor is the development of higher level languages for programming these GPUs. Initially coding GPUs required the developer to write in an assembly language specific to the device in question, however recently it has become possible to develop in higher level languages, often derived from C, significantly simplifying and accelerating the development process. 

Programming tools 

There are no strict requirements for the code used as a starting point for the application of the GPGPU approach. The most common is perhaps C/C++ code as the language affords broad functionality and high performance but starting with existing code in Fortran, Java, Pascal or Basic is entirely possible. 

For the GPGPU code itself there are three main Application Programming Interfaces (APIs) currently available: NVIDIA CUDA, OpenCL and DirectCompute. Each API affords a set of advantages and disadvantages and our current framework utilises OpenCL. Due to the considerable number of similarities between the APIs it is possible to switch between them should the project requirements demand so.

Main challenges

There are a number of challenges associated with successfully adopting the GPGPU compute solution to any particular problem.

More so than in the case of a conventional serial solution to the problem realising the potential performance of the GPGPU approach requires an understanding of the interaction between the overall algorithm, numerical methods used and low-level implementation details such as memory and computational task management.

Some of the main issues are:

  • The algorithm needs to be parallelized – not all algorithms lend themselves readily to parallelization without modification (e.g. space-directional or time-explicit schemes)
  • The number of parallel computations must be large with preferably homogeneous run times. GPGPU compute approach works best when the number of parallel computations is in the 1000s with each taking approximately the same time – otherwise the benefits quickly diminish. This can be sometimes be addressed by adjusting the numerical approach or selective workgroup grouping.
  • To achieve significant performance increases, code must be written specifically to target the GPU architecture. Both computation tasks and memory allocation must be such as to mesh well with the hardware architecture in terms of size, grouping and cross-dependencies.
  • Whilst a subject area under active development and research, this is still a relatively new approach. As such, compilers, debuggers and other programming tools are not yet fully refined and automated, requiring a higher level of understanding of hardware and driver behaviour to derive significant performance increases.

What we can do

Each problem considered for the application of the GPGPU compute method is different and the solutions will vary considerably on a case-by-case basis. However, there are some common steps to a successful solution:

  • Analyse the algorithm. This is a necessary and most important first step when considering applying the GPGPU approach to an existing solution. Typical steps here are:
    • Identify parallelizable sections
    • Do performance timings to see whether parallelizing the code will lead to performance improvements
    • Look at the memory usage of the algorithm to see if it is suited to running on the GPU
  • Adjust the numerical approach to better suit GPU implementation if required. Not all algorithms and approaches are readily parallelizable or suitable to run on a GPU  – this may be down to the nature of the numerical solution, memory management or a combination of causes. Often these obstacles can be overcome by adjusting the numerical methods used or maybe higher-level algorithm or numerical approach.
  • Implementation: Write the parallelized code, integrate with the rest of computations, validate against existing results.
  • Refinement and expansion. After the initial performance goals are achieved, the approach can be iterative – sequentially lifting bottlenecks that become apparent after the initial improvement, enhancing solution performance, deeper adjustments to numerical approach and higher-level algorithm.

Why us?

We as a company are uniquely suited to the application of the GPGPU approach to solving both new and existing problems. Apart from our personal interest in the subject area, we have:

  • Extensive integrated background in physics, applied mathematics and computing combined with a proven history of commercial project delivery
  • Record of successful application of GPU compute methods to existing algorithms and deployment of commercialized software with GPU compute sections
  • Experience of integration of GPU compute with 3D visualization, with other simulations as part of a workflow and with Graphical User Interfaces as part of commercial packages


We are looking for a talented software developer with a strong background in C++ and experience of C#/.NET WinForms. A Mathematical modelling or physics background or experience with 3D rendering technologies would be a significant advantage.

Oxford Numerics is a mathematical modelling, scientific consultancy and software development company. We are a focused team of professionals with an extensive skill set, strong academic background and an emphasis on delivering robust and accessible solutions.

We have over 12 years of trusted relationships with major multinational clients, and experience of commercial projects in several fields. Our solutions are in use around the world and our team has worked for clients in Europe, North America and the Middle East.

In the energy sector, we have developed engineering, design and simulation tools for modelling complex oil and gas well operations. In the financial sector our experience includes the development of pricing and risk software for a wide range of fixed income products, including bonds and credit and interest rate derivatives and other instruments.

In all areas of our work, we have developed sophisticated 2D and 3D visualization tools to simplify the analysis and presentation of highly complex data sets.

How can we help you?

If you have an idea you would like to test we can work with you to develop a rapid prototype and suggest the best and most cost-effective options for implementation. We have experience of the entire product life cycle and the expertise required to produce and deploy enterprise level solutions.

We have a flexible contract framework allowing you to bring us in as and when you need us - e.g. for rapid prototyping, supporting your in-house teams or to implement a complete product.

Why us?

  • We bridge the gap between modelling and commercial software development due to our extensive expertise in both areas. This experience allows us to understand and speak the language of our clients, whether in the energy, financial services or real estate sectors, while being able to draw on techniques and best practice from our physics, mathematics and software development background.
  • We have a track record of successfully implementing parallel programming techniques, which can increase computational speed by orders of magnitude. Read more about GPGPU computing.
  • We are experts in taking your existing mathematical models from academic literature, Excel spreadsheets, legacy code bases (e.g. Matlab, VB, FORTRAN) and implementing them into a robust, modern, extensible and future-proof software architecture.
  • We recognize that analysis is useless if the results cannot be understood. We therefore specialize in producing software that can interpret and visualize complex simulation results in a clear, user-friendly way that can inform and improve commercial decision making.

Contact us to discuss how we could help your project:

Email: This email address is being protected from spambots. You need JavaScript enabled to view it.
Phone: 020 7490 5786

Find us on map (407 Davina House)
Call us (020 7490 5786)
This email address is being protected from spambots. You need JavaScript enabled to view it.