"There should not have been any problem. The hardware and software provided plenty of capacity, with a cluster of four servers, each an HP VL360 G5 with 2 quad-core processors. Three of them were running SharePoint® Server on 16G of RAM, and the fourth ran the database server on 32G. The servers sat behind a Cisco CSS load balancer on a 45Mbps DS3 line. The database storage was an EMC Clarion CX-500 SAN and the web servers used only local disk storage. All servers were running Windows Server 2003 64-bit Enterprise SP2. The web servers ran Microsoft Office SharePoint® 2007 64-bit, and the SQL server ran SQL Server 2005, roll-up 8."
Como foi resovido???
Revisiting SQL Server and the rebooting fix
We now found ourselves wondering whether the SharePoint® Server or the SQL Server was the culprit. We recalled our discovery in previous testing that rebooting the database fixed the problem and brought it to the attention of the Microsoft Support engineers.
We also found that if we stopped the load test when the servers were in a degraded state and restarted within a few minutes, the degradation would continue, even at very low load levels. Further diagnostics around these symptoms revealed that once the system performance had degraded significantly, clearing the query plan cache in SQL Server (via DBCC FREEPROCCACHE) would restore system performance almost immediately. Unfortunately the fix was not permanent, and performance degraded again within a short period of time.
Single-threaded cache access in a multi-processor system
These discoveries led the Microsoft engineers to a Microsoft Knowledge Base article (#927396) that indicated problems with the size of the TokenAndPermUserStore cache in SQL Server. When the server has a large amount of physical memory (in this case 32G) and the rate of random dynamic queries is high, the number of entries in this cache grows rapidly. As the cache grows, the time required to traverse and cleanup the cache can be substantial. Because access to this cache is single-threaded, queries can pile up behind each other waiting for the cleanup to complete. This queuing slows performance and prevents a multi-processor system from scaling as expected. The remedy was to start SQL Server with a “-T4618” parameter, which limits the TokenAndPermUserStore cache size. (This was not one of the solutions listed in the Microsoft Knowledge Base for this issue – it was provided by a Microsoft Support Engineer).
Security Token Cache Size bug in SharePoint®
After the cache-limit fix was applied to SQL Server, the next load test of the system showed steady performance with 15 pages/sec and APDs under 1 second, supporting 650 concurrent users for 10 hours. However, in a subsequent load test, errors reading “Arithmetic operation resulted in an overflow.” started appearing in the pages, indicating that SharePoint® was unable to render many web parts on the page. Microsoft quickly traced this to a bug in a SharePoint® cache implementation that was fixed by reducing the SharePoint® Security Token Cache size. Apparently object cache throws Integer Overflow exceptions when cache size is greater than 2000.
With the above fix applied and tested, the system was ready for a longer stress test to judge the stability of the system over longer periods. The next load test ran for 48 hours at 650 users. The system performed well – easily satisfying the performance requirement with only a single SharePoint® web server. No degradation of performance was observed. Further testing with all three SharePoint® servers and higher load levels showed similar success.