Troubleshoot SmartView Performance Issues


Symptoms

Typically, the problem will arise only with large spreadsheets containing large numbers of SmartView cells. It may only occur when Refresh All is executed but not when worksheets are refreshed individually, or it may only occur at times of high load such as month end. It is more likely to arise with SmartView functions than Adhoc, because Adhocs are more efficient than functions. Such symptoms as these would point to a performance issue rather than a bug. However, if IIS logs show Deadlock detected, this is probably sign of a bug, where SmartView crashes IIS on the web server. Symptoms here would be SmartView becoming unresponsive and new users being unable to log on for a while. Restarting the web server resolves. This situation is outside the scope of this document.

Support

Once an actual bug has been eliminated, the issue must be treated as a performance problem. This lies outside the remit of your contract with Oracle Product Support. You may decide to contact a consultant or to try to tune the system yourself. Please understand that Support can only give you general and generic guidance in this. We will not read logs or make recommendations.

Background

SmartView communicates over HTTP using .NET. This just means that requests and responses are sent in a standard XML format that is tucked in the HTTP headers. The mechanism is the same as when your internet browser requests a file or submits a form (simplification). A standard Microsoft component that is part of Internet Explorer is used for this.

There are three components (four if you include the network) in a SmartView refresh:

  • the client
  • the web server
  • the HFM application server

It is easiest to troubleshoot when they are on separate machines. Running SmartView on the server to see whether it works better is a useful test to see what effect the network is having, but doesn’t show any other bottlenecks. If you run Task Manager on these 3 machines showing the Performance tab, you will see initial activity on the client as Excel clears the cells and the request is converted into XML. Then you see the web server gets busy as it receives and parses the request. Then the application server is busy as it sends the data, then the web server is busy as it compiles the results into XML again, and finally the client parses the XML, pushes the data into cells and runs an Excel calculation. This last stage is the longest of all. Overall, it is typical to see the client doing 90-95% of the work, so client performance is a very important factor. Look out for Excel’s memory usage increasing and RAM becoming in short supply. Often, user dissatisfaction with refresh times can be alleviated by installing more RAM or providing a faster workstation.

The situation we are dealing with here, however, is when the chain of events described above is interrupted. In this case, you will see the client or the web server wait in vain for a response from the next machine in the chain. The case of invalid XML is similar, but the response is truncated so that the XML is not well-formed when it arrives. XML consists of matching pairs of tags and is not well-formed when a closing tag is missing. This can be caused by certain network settings such as packet size (see later). The end result in either case is the same: as Excel cleared the cells before sending the request, cells are left blank or with zeroes.

The task now is to determine which particular component is failing.

Client

Check the HTTP timeout in the Windows registry. Example:

[HKEY_CURRENT_USER\Software\Microsoft\Windows\CurrentVersion\Internet Settings]\
“ReceiveTimeout”=dword:00dbba00
“KeepAliveTimeout”=dword:002BF20
“ServerInfoTimeout”=dword:002BF20

The default time-out limit is five minutes for versions 4.0 and 4.01 and is 60 minutes for versions 5.x and 6. Also, the default time-out limit is 30 seconds for Windows Internet Explorer 7 and 8. To change the settings, add the above keys. The above values are shown in hexadecimal, not decimal and correspond to 180,000 milliseconds (3 minutes) and 240 minutes for the receive timeout.

For more information about client timeout settings see following articles:
How to change the default keep-alive time-out value in Internet Explorer – http://support.microsoft.com/kb/813827
Internet Explorer error “connection timed out” when server does not respond – http://support.microsoft.com/kb/181050

You must determine an adequate timeout limit for your environment and largest workbook by trial and error or experience. Once these changes are applied, my recommendation is to reboot the client machine.

Web server

Look for errors in logs showing that the web app didn’t respond/disconnected or timed out waiting for another application (i.e. the HFM application server). Eliminate load balancers etc. during testing in order to isolate the problem and to ensure that the user always goes to the same web server. Otherwise, logs can be hard to interpret. See also the IIS Tuning Guide for IIS settings. We cannot recommend particular settings, as each client must establish the optimal settings for their own environment after performance testing. If necessary, engage the services of a suitably qualified consultant.

Application Server

HFM application server performance is very dependent on whether the data is already in RAM or not. If  the first refresh fails but the second one succeeds, this is probably a sign that the memory cache needed to be populated, which had been done during the first refresh. HFM loads whole subcubes (Scenario, Year, Value and Entity) into cache, so requesting large number of accounts in the same subcube quickly returns results, but if large number of entities is requested the results are returned slowly as the server runs out of available RAM and starts unloading subcubes to make room for more. The HFM event log (HsvEventLog.log) shows this happening, but don’t forget that the messages you see are a normal consequence of the way HFM works and are not a problem unless you see a hugely elevated number of them during the period the refresh occurs. Another sign would be page file thrashing. If there are several servers in a cluster, use sticky sessions in HFM to make sure tests always go the same server, or hit one server directly rather than using the cluster.

Solutions here involve installing more RAM on the server and changing the worksheet design to restrict the number of subcubes accessed. A separate consolidation server, so that reads are not competing with intensive consolidations and calculations, would be a good thing.

Network

To find out if packets get fragmented, use
         ping -l size -f

To send a big packet with the ‘don’t fragment’  flag. E.g.
        ping -l 1024 -f oracle.com

If you see a reply like Reply from 141.146.8.66: bytes=1024 time=200ms TTL=245 then the size you specified is below the packet limit of the network. But if you use, for example, 2000 in place of 1024, and you get Packet needs to be fragmented but DF set, then you know you have exceeded the limit. Combine this investigation with a Fiddler trace (https://www.fiddler2.com/fiddler2/) to see what size data messages are being sent and received.

Troubleshooting Guidance

  1. During troubleshooting process, fix your environment so that it is predictable. Register a single HFM application server and connect to that. Use a single web server and don’t go through Apache or any other redirector. Make sure we know which machine to go to for logs. This would be a temporary measure and it doesn’t stop the other users using the full, load balanced system so long as it doesn’t involve your server.
  2. As far as possible keep other users from using it during testing, unless you are specifically testing for performance under load. We want to see what activity is due to SmartView and not random other jobs.
  3. Use PerfMon on all three machines: client, web server and application server.
  4. Clear Logs before testing and note the times of each test so the logs can be correlated.
  5. Log CPU activity, memory usage, network throughput, page faults, thread count etc. I’m no expert on this so get advice from someone who is.
  6. In addition to the performance logs, from the IIS Server get:
    a. HTTP.SYS Error Log – %windir%\System32\LogFiles\HTTPERR (Default location; configurable)
    b. IIS Website Log – %windir%\System32\LogFiles\W3SVC# (Default location; configurable)
    c. Event Log (both System and Application) in EVT or EVTX and TXT (tab separated) formats
  7. From HFM Application server get:
    a. HsvEventLog.log (in <HFM install>\Server working Folder)
    b. Event Log (both System and Application) in EVT and TXT (tab separated) formats
  8. Be aware of all timeout settings in your environment (i.e. IIS & the network) and time your tests. This can show whether it is an ASP timeout, script timeout, client side timeout or whatever. Note that the Web Session Timeout setting for SmartView in HFM’s Server and Web Configuration is a separate issue from IIS timeouts. Web Session Timeout controls the length of idle time before SmartView makes the user log in again. Default 20 mins.
  9. Run Fiddler traces to directly observe the XML and check for HTTP failures.

Leave a comment