We are currently growing our activities in the financial sector. Our previous experience in this field as well as our extensive knowledge of mathematical modelling techniques and cutting-edge programming technologies make us ideally suited to provide services to clients looking to develop bespoke software solutions such as:
- High performance libraries for complex pricing and risk algorithms.
- Libraries for low level but critical functionality such as date manipulation.
- Financial web applications including mobile app development.
- Rapid prototyping and proof of concept development work.
- End-to-end development of stand alone or networked enterprise level applications.
With this in mind we have recently developed a demo web application providing swap valuation services, which is shown below.
This project has been built using a modular programming approach, implementing libraries for low level features including date manipulation, yield curve construction and market data access that may be re-used in future projects. In addition the project has provided valuable experience with a number web technologies including:
- Microsoft Azure cloud services
- SQL server
- Server-side C# ASP .NET 4.5.2
- Web sockets (SignalR library)
While current graphics processing units (GPUs) are able to offer at least an order of magnitude improvement in processing power over current central processing units (CPUs) that have been traditionally used to perform calculations accessing this performance is frequently less than trivial. This series of articles examines some of the issues that arise when trying to optimize the performance of code running on the GPU.
A modern CPU, as shown in Figure 1, consists of a small number of identical, complex general purpose cores. Each of these cores is design to process a wide range of instructions and to execute a single thread of instructions at high speed. Much of the complexity of these cores is due to the fact that to improve speed the processing of each instruction is pipelined.
Each stage of the processing of an instruction is done by a separate section of the hardware, this allows many instructions to be processed at the same time rather than waiting for each one to finish before beginning the next, as shown in Figure 2 for a 16 stage pipeline. This approach does have drawbacks however. Chief amongst these is that it may not be possible to start processing the next instruction, for example if it depends on the result of an instruction still being processed or on data still being loaded. Modern CPUs allow for this by being able to look ahead in the list of upcoming instructions to find one that can be started, this is known as out of order execution. The drawback is that the hardware to do this is quite complex and reduces the number of transistors that can be allocated to the job of actually processing the instructions. This is just one example of how the general purpose nature of the CPU detracts from its ability to do a single job, in this case processing
In contrast to a CPU a GPU consists of a very large number of identical, simple, in-order processing elements. This structure is shown in Figure 3. These elements are simple, efficient and small precisely because, unlike a core in a CPU, they are designed only to support a limited set of primarily arithmetical operations. The remaining functions required of the GPU are performed by a much smaller number of special purpose elements which are shared between the much larger numbers of processing elements. The processing elements are organised into groups each of which shares a number of special purpose elements and other resources forming what is commonly termed a “compute core”.
While the structure of a GPU is very different from that of a CPU many of the challenges in extracting the maximum performance from it remain the same. One of the primary challenges, as alluded to above when discussing the structure of a CPU, is to ensure that all of the processing elements are kept busy, the measure of which is known as kernel occupancy. With its very different structure the GPU uses a different strategy from the CPU to maximize this.
Workgroups and Work Items
To be able to extract the maximum performance from the GPU the structure of the problem solved by the GPU must mirror its structure. It must consist of a large number of identical tasks, or work items, each of which can be processed by one of the processing elements.
It is the nature of the problem that allows the GPU to adopt a different approach to handling cases where a processing element cannot for some reason execute the next instruction in the thread of instructions it is working on. Instead of including complex logic for looking ahead in the thread of instructions for an instruction it can execute the GPU leverages the fact that there are many work items to be performed and simply switches to another work item for which the next instruction may be executed.
As mentioned above a single processing element is not complete by itself but rather is part of a compute core. The processing elements of a compute core share resources and may switch between the work items assigned to the compute core, but not to work items assigned to other compute cores. The set of work items assigned to a single compute core is known as a work group.
One of the most significant factors in maximising kernel occupancy is therefore to ensure that the problem to be solved is split into the optimal number of work groups each containing the optimal number of work items. This is a non-trivial task and the correct values will vary depending on the hardware. The aim being to ensure that there are sufficient work groups to keep all the compute cores occupied and sufficient work items in each work group to keep all the processing elements in the compute cores occupied. Some of the principal factors to be considered when computing the size of the work groups are listed below:
- The number of compute cores available on the device being used.
- The number of processing elements per compute core on the device being used.
- The number of registers available per compute core vs the number required by the program to be executed.
- The amount of shared memory available to each compute core vs the amount required by the program to be executed.
General Purpose Graphical Processing Unit (GPGPU) computing offers a powerful and cost-effective way of dramatically improving the computational speed of computer simulations.
By parallelising the computations and utilizing the consumer graphics hardware now present in most modern computers, speed increases of one to two orders of magnitude can potentially be achieved when compared with the more common single-threaded CPU approach.
Achieving this is not without its challenges – we briefly outline what these are with a summary of how we would go about employing the approach for a particular problem and why we are uniquely suited to the task at hand.
Introduction to GPGPU
Recent years have seen increasing interest in the concept of GPGPU computing. This can be attributed to three main factors. The first is the increasing disparity between the floating point computing power provided by GPUs and CPUs as shown in the following diagrams:
The theoretical floating point performance and memory transfer speed of the GPU is many times higher than that of CPUs. Combined with the relatively low price of the hardware and its wide availability in modern computers makes using GPGPU computing an attractive proposition.
The second factor is that along with the increase in power there has been a shift in the design of GPUs, away from hardwired devices designed to perform a few specific graphical operations and towards massively parallel general purpose computing devices. This added programmability has allowed the computing power such GPUs offer to be harnessed for tasks other than graphics.
The final factor is the development of higher level languages for programming these GPUs. Initially coding GPUs required the developer to write in an assembly language specific to the device in question, however recently it has become possible to develop in higher level languages, often derived from C, significantly simplifying and accelerating the development process.
There are no strict requirements for the code used as a starting point for the application of the GPGPU approach. The most common is perhaps C/C++ code as the language affords broad functionality and high performance but starting with existing code in Fortran, Java, Pascal or Basic is entirely possible.
For the GPGPU code itself there are three main Application Programming Interfaces (APIs) currently available: NVIDIA CUDA, OpenCL and DirectCompute. Each API affords a set of advantages and disadvantages and our current framework utilises OpenCL. Due to the considerable number of similarities between the APIs it is possible to switch between them should the project requirements demand so.
There are a number of challenges associated with successfully adopting the GPGPU compute solution to any particular problem.
More so than in the case of a conventional serial solution to the problem realising the potential performance of the GPGPU approach requires an understanding of the interaction between the overall algorithm, numerical methods used and low-level implementation details such as memory and computational task management.
Some of the main issues are:
- The algorithm needs to be parallelized – not all algorithms lend themselves readily to parallelization without modification (e.g. space-directional or time-explicit schemes)
- The number of parallel computations must be large with preferably homogeneous run times. GPGPU compute approach works best when the number of parallel computations is in the 1000s with each taking approximately the same time – otherwise the benefits quickly diminish. This can be sometimes be addressed by adjusting the numerical approach or selective workgroup grouping.
- To achieve significant performance increases, code must be written specifically to target the GPU architecture. Both computation tasks and memory allocation must be such as to mesh well with the hardware architecture in terms of size, grouping and cross-dependencies.
- Whilst a subject area under active development and research, this is still a relatively new approach. As such, compilers, debuggers and other programming tools are not yet fully refined and automated, requiring a higher level of understanding of hardware and driver behaviour to derive significant performance increases.
What we can do
Each problem considered for the application of the GPGPU compute method is different and the solutions will vary considerably on a case-by-case basis. However, there are some common steps to a successful solution:
- Analyse the algorithm. This is a necessary and most important first step when considering applying the GPGPU approach to an existing solution. Typical steps here are:
- Identify parallelizable sections
- Do performance timings to see whether parallelizing the code will lead to performance improvements
- Look at the memory usage of the algorithm to see if it is suited to running on the GPU
- Adjust the numerical approach to better suit GPU implementation if required. Not all algorithms and approaches are readily parallelizable or suitable to run on a GPU – this may be down to the nature of the numerical solution, memory management or a combination of causes. Often these obstacles can be overcome by adjusting the numerical methods used or maybe higher-level algorithm or numerical approach.
- Implementation: Write the parallelized code, integrate with the rest of computations, validate against existing results.
- Refinement and expansion. After the initial performance goals are achieved, the approach can be iterative – sequentially lifting bottlenecks that become apparent after the initial improvement, enhancing solution performance, deeper adjustments to numerical approach and higher-level algorithm.
We as a company are uniquely suited to the application of the GPGPU approach to solving both new and existing problems. Apart from our personal interest in the subject area, we have:
- Extensive integrated background in physics, applied mathematics and computing combined with a proven history of commercial project delivery
- Record of successful application of GPU compute methods to existing algorithms and deployment of commercialized software with GPU compute sections
- Experience of integration of GPU compute with 3D visualization, with other simulations as part of a workflow and with Graphical User Interfaces as part of commercial packages
We are looking for a talented software developer with a strong background in C++ and experience of C#/.NET WinForms. A Mathematical modelling or physics background or experience with 3D rendering technologies would be a significant advantage.
Oxford Numerics is a mathematical modelling, scientific consultancy and software development company. We are a focused team of professionals with an extensive skill set, strong academic background and an emphasis on delivering robust and accessible solutions.
We have over 12 years of trusted relationships with major multinational clients, and experience of commercial projects in several fields. Our solutions are in use around the world and our team has worked for clients in Europe, North America and the Middle East.
In the energy sector, we have developed engineering, design and simulation tools for modelling complex oil and gas well operations. In the financial sector our experience includes the development of pricing and risk software for a wide range of fixed income products, including bonds and credit and interest rate derivatives and other instruments.
In all areas of our work, we have developed sophisticated 2D and 3D visualization tools to simplify the analysis and presentation of highly complex data sets.
How can we help you?
If you have an idea you would like to test we can work with you to develop a rapid prototype and suggest the best and most cost-effective options for implementation. We have experience of the entire product life cycle and the expertise required to produce and deploy enterprise level solutions.
We have a flexible contract framework allowing you to bring us in as and when you need us - e.g. for rapid prototyping, supporting your in-house teams or to implement a complete product.
- We bridge the gap between modelling and commercial software development due to our extensive expertise in both areas. This experience allows us to understand and speak the language of our clients, whether in the energy, financial services or real estate sectors, while being able to draw on techniques and best practice from our physics, mathematics and software development background.
- We have a track record of successfully implementing parallel programming techniques, which can increase computational speed by orders of magnitude. Read more about GPGPU computing.
- We are experts in taking your existing mathematical models from academic literature, Excel spreadsheets, legacy code bases (e.g. Matlab, VB, FORTRAN) and implementing them into a robust, modern, extensible and future-proof software architecture.
- We recognize that analysis is useless if the results cannot be understood. We therefore specialize in producing software that can interpret and visualize complex simulation results in a clear, user-friendly way that can inform and improve commercial decision making.
Contact us to discuss how we could help your project:
Phone: 020 7490 5786
Call us (020 7490 5786)