Monday, March 06, 2023

"In the Cloud, Performance is Instrumented as Cost"

About 5 years ago, I was at a conference where someone put this statement up in a PowerPoint slide.  (I would like to be able to correctly credit the author, but I can't remember who it was).  We all looked it at, thought about and said 'yes, of course' to ourselves.  However, as a consultant who specialises in performance optimisation, it has taken until only recently that I started to have conversations with clients that reflect that idea.

In the good old/bad old days of 'on premises'

It is not that long ago that the only option for procuring new hardware was to go through a sizing exercise that involved guessing how much you needed, allowing for future growth in data and processing volumes, and then deciding how much you were actually willing to afford, purchase it, and finally wheel it into your data centre and hope for the best.

It was then normal to want to get the best possible performance out of whatever system was installed on that hardware.  It would inevitably slow down over time.  Eventually, after the hardware purchase had been fully depreciated, you would have to start the whole cycle again and replace the hardware with newer hardware.

Similarly, Oracle licencing.  You would have to licence Oracle for all your CPUs (there are a few exceptions where you can associate specific CPUs to specific VMs and only licence Oracle for the CPUs in those VMs).  You would also have to decide how many Oracle features you licenced.  Standard or Enterprise Edition?  Diagnostics? Tuning? RAC? Partitioning? Compression? In-Memory?

"You are gonna need a bigger boat"

Then when you encountered performance problems you did the best you could with what you had. As a consultant, there was rarely any point in saying to a customer that they had run out of resource and they needed more.  The answer was usually along the lines of 'we have spent our money on that, and it has to last for five years, we have no additional budget and it has to work'. So you got on with finding the rabbit in the hat.

In the cloud, instead of purchasing hardware as a capital expense, you rent hardware as an operational expense.

You can bring your own Oracle licence (BYOL), and then you have exactly what you were previously licenced for.  "At a high level, one Oracle Processor License maps to two OCPUs."

With Oracle's cloud licencing there are still lots of choices to make, not just how many CPUs and how much memory.  You can choose Infrastructure as a Service (IAAS) where you rent the server and install and licence Oracle on it just as you did on-premises.  You can choose different storage systems with different I/O profiles.  There are different levels of PAAS that have different database features.  You can go all the way up to Extreme performance on Exadata.  All of these choices have a cost consequence.  Oracle provides a Cloud cost estimator tool (other consultancies have produced their own versions).  These tools clearly show the link between these choices and their costs very clear.

You can have as much performance as you are willing to pay for

I have been working with a customer who is moving a PeopleSoft system from Supercluster on-premises to Exadata Cloud-at-Customer (so it is physically on-site, but in all other respects it is in the cloud).  They are not bringing their own licence (BYOL). Instead, they are on a tariff of US$1.3441/OCPU/hr, we have found it easier to talk about US$1000/OCPU/month.

Just as you would with an on-premises system, they went through a sizing exercise that predicted they needed 6 OCPU on each of 2 RAC nodes during the day, and 10 at night. 

It has been very helpful to have a clear quantitative definition of acceptable performance for the critical part of the system, the overnight reporting batch.  "The reports need to be available to users by the start of the working day in continental Europe, at 8am CET", which is 2am EST.  There is no benefit in providing additional resources to allow the batch to finish any earlier.  Instead, we only need to provide as much as is necessary to reliably meet the target.

A performance tuning/testing exercise quickly showed that fewer than the predicted number of CPUs were actually needed.  2-4 OCPUs/node during the day is looking comfortable.  The new Exadata has fewer but much faster CPUs.  As we adjusted the application configuration to match we found we are able to reduce the number of OCPUs. 

If we hadn't already been using base-level In Memory feature on Supercluster, then to complete the overnight batch in time for the start of the European working day, we would probably have needed 10 OCPUs/node.  The base-level In Memory option brought that down to around 7.  This shows the huge value of the careful use of database features and techniques to reduce CPU overhead.

We are not using BYOL, so we can use fully featured In Memory with a larger store.  Increasing the In Memory store from 16Gb to 40Gb per node saved another OCPU, but cost nothing.  If we had been using BYOL we would have had to pay additionally for fully featured In Memory.  I doubt the marginal benefit would have justified the cost.

The customer has been considering switching on the extra OCPUs overnight to facilitate the batch.  Doing so costs $1.33/hour, and at the end of the month, they get an invoice from Oracle.  That has concentrated minds and changed behaviours.  The customer understands that there is a real $ cost/saving to their business decisions.

One day I was asked: "What happens if we reduce the number of CPUs from 6 to 4?"

Essentially the batch will take longer.  We are already using the database resource manager to prioritise processes when all the CPU is in use.  The resource manager plan has been built to reflect the business priorities, and so keeps it fair for all users.  For example, it ensures that users of the online part of the application get CPU in preference to batch processes, this is important for users in Asia who are online when the batch runs overnight in North America.  We also use the resource plan to impose different parallel query limits to different groups of processes.   If we are going to vary the number of CPUs we will have to switch between different resource manager plans with different limits.  We will also have to reduce the number of reports that can be concurrently executed by the application, so some application configuration has to go hand in hand with the database configuration.

Effective caching by the database meant we already did relatively little physical I/O during the reporting.  Most of the time was already spent on CPU.  Use of In Memory further reduced physical I/O, and now nearly all the time is spent on CPU, but it also reduced the overall CPU consumption and therefore response time.

When we did vary the number of CPUs, we were not surprised to observe, from the Active Session History (ASH), that the total amount of database time spent on CPU by the nVision reporting processes is roughly constant (indicated by the blue area in the below charts).  If we reduce the number of concurrent processes, then the batch simply runs for longer.

There is no question that effective design and tuning are as important as they ever were.  The laws of physics are the same in the cloud as they are in your own data centre.  We worked hard to get the reporting to this level of performance and down to this CPU usage. 
The difference is that now you can measure exactly how much that effort is saving you on your cloud subscription, and you can choose to spend more or less on that cloud subscription in order to achieve your business objectives.

Determining the benefit to the business, in terms of the quantity and cost of users' time, remains as difficult as ever.  However, it was not a major consideration in this example because this all happens before the users are at work.