Oracle Database Redo Logs Story

From time to time, I am called to check DB performance for various clients, and quite often, I observe a similar pattern:

- The DB can’t handle higher traffic.
- There is currently no DBA, and maybe never was.
- It has been an issue from the start, but it was bearable, and lately due to company business it is worse.
- Or it started to appear after the migration to the cloud, which wasn’t done by the DBA (because you don’t need a DBA in the cloud /irony off).

What would you do first? Especially in the situation when you don't have time to analyze AWR and PROD is down?

In my case, in such situations, my initial action is always to check:
1. The OS and swap usage.
2. The alert log for obvious errors.
3. Redo log switches per hour.

If the above is correct, I generate AWR from peak time and continue with a proper investigation. But reality is, that redo logs are often heavily undersized and causing issues.

When the App <-> DB traffic rises, redo log switches become problematic. Of course, bad application design (like committing each row separately) is very common as well, but application design can’t always be fixed quickly. So, I tend to make redo logs bigger to allow for some traffic—it's a low-hanging fruit. With undersized redo logs, AWR is not as relevant because all the performance is in log switching and waiting. AWR becomes much better for any further investigation when we enable normal traffic between the application and the database.

For the redo logs checks, I use a select statement—a kind of heat map to see when the peaks are and how many log switches they caused. I aim for 4-6 log switches per hour in peak time, so I configure logs accordingly. Only after that do I start with a proper investigation based on AWR, ADDM, query tuning, etc.

Success story:
Once, I fixed an unusable production database with this approach. It was constantly restarting or hanging every Monday when users massively logged in. There were around 400-500 log switches per hour at the peak. You can imagine how upset that company was. And you can’t imagine how happy everyone was after just such an easy change in redo log sizing. The DB started to be responsive, and we could start with application redesign and other steps in performance tuning, but that’s for the long run.

When PROD is in issues, you need immediate action to handle the pressure. The beauty is often in simple and ordinary things.

Do you use any other easy tricks to find low-hanging fruit?