SCOM – Upgrade to 1801 on production

Hey

This post would cover my experiences with upgrading multi-server setup of SCOM 2016 to SCOM 1801, as well as underline some of the aspects of the upgrade that are skipped / not given enough attention in the upgrade guide, according to my own subject

If you haven’t done so already, here is the link.

Read it. Then think if you understand everything, and then read it again 🙂

In case of Ooops….

I can’t stress how important is it to have VM checkpoints (aka snapshots) in place. And make sure that your storage has enough space to hold the differential disks for long enough to complete the upgrade. I assume your OM environment is fully virtualized, if it’s not – make sure you have reliable backup of your physical servers. My test environment actually didn’t survive the upgrade, the agent didn’t start up due to cryptic error “Class not registered”. MS techie on support gave up on my case saying I missed (and I did!) minimum requirements, which is 4 cores and 8GB of RAM, on my test server. Shame 🙂

Patch it twice.

Once you get the VM snapshots done, my recommendation is to check Microsoft Update for patches to SCOM installation software. If you don’t have that kind of access, try to bribe the network guys. Throw some routers at them or whatever works in your case ;). Case is, MS frequently updates their installation software but not necessarily the ISO files the software comes with. And there might be fairly important bugfixes which decide whether your deployment was a success or not.

A little bit of scripting here and there

You need to stop all SCOM services on all other management servers. A short PS:

$Servers = (Get-SCOMResourcePool "All Management Servers Resource Pool" | % Members | % DisplayName).Where{$_ -notlike "$env:computername*"}

This gets Management servers which are not the server you are running on.

Get-Service omsdk,healthservice,cshost -ComputerName $servers

This should return all services as stopped. If it does, you have met one of the important prereqs.

Also – you will fail to upgrade SCOM if you don’t know your service account passwords, check your documentation or password managers.

First management server takes really long time to upgrade. Subsequent servers upgrade much faster – this is due to SCOM database / DWH being upgraded along with the first management server.

Telemetry data / Application Insights in web console

Another nifty thing is, that I encourage you make use of most, is SCOM 1801 Web Console. There is seemingly little-known fact MS has incorporated their Application Insights into new web console, so telemetry data is being sent out to MS systems to check performance usage. See more details here. Application Insights, is, if you are unfamiliar with it, a telemetry service which provides devs pretty roboust information on how the application is used, it’s response times and so on and so forth. Unless you have a very specific requirement to not turn this on, I’d suggest forwarding this data to MS (actually every telemetry data) – it’s in their interest to make SCOM better, and in our interest to make our everyday monitoring tool a bit better.

Upgrade agents – *poof* audit data gone

Last time I upgraded SCOM agents I realized the hard way agents are loosing their audit configuration data upon upgrade. So you need to set them again, if you use ACS collectors. Now, with little help of ConfigMgr I made myself a quick compliance check to make sure all domain controllers are running properly configured ACS agent.

$AdtService = Get-Service AdtAgent

if ($AdtService.Status -eq "Running") {
return "OK"
}
else {
return "Not OK"
}

And also registry query:

HKLM:\SOFTWARE\Policies\Microsoft\AdtAgent\Parameters\AdtServers - with following check in SCCM.

ACS CI

And these rules are collected as compliance baseline, deployed to Domain Controllers collection. Now I’m certain I’ll get alerted if I forget to set it up again. In case anyone needs a SCCM AD Domain Controller collection query definition, here it is:

select SMS_R_SYSTEM.ResourceID,SMS_R_SYSTEM.ResourceType,SMS_R_SYSTEM.Name,
SMS_R_SYSTEM.SMSUniqueIdentifier,SMS_R_SYSTEM.ResourceDomainORWorkgroup,
SMS_R_SYSTEM.Client from SMS_R_System inner join SMS_G_System_COMPUTER_SYSTEM on SMS_G_System_COMPUTER_SYSTEM.ResourceId = SMS_R_System.ResourceId where SMS_R_System.PrimaryGroupID = "516"

Timeouts, all timeouts

The upgrade was successful, well, to some extent. The default configuration in ConfigService.config file might contain timeout thresholds that are pretty low. I had all my management servers error out on me mostly with 29181 error:

OpsMgr Management Configuration Service failed to execute 'LocalHealthServiceDirtyNotification' engine work item due to the following exception

Microsoft.EnterpriseManagement.ManagementConfiguration
DataAccessLayer.DataAccessOperationTimeoutException

I have dared to increase timeouts for these operations to complete (I have no proper insight on consequences on this one, follow at your own merit), modifying one of the lines in the config file, increasing timeout until the problem was gone.

Then I copied the configFile and bounced cshost services on Management Servers. I really hope this didn’t set up a delayed bomb of some sort [ please do assure me or proove me wrong ]

That’s about it so far, I might drop in a short post later if anything new comes up. In the meanwhile I’ll enjoy new shiny SCOM if you may

Br.
~Alex

3 thoughts on “SCOM – Upgrade to 1801 on production

    1. Thank you! And yes – it seems to have been overwritten! I’ve had troubles onboarding my production to OMS and raised a case with MS Support once. Your comment got me thinking, I dug up the summary of that ticket and compared to running values – I’ve not been modifying those and these are set to their defaults. This is what we changed with MS support guy, these setting seem to be running defaults on production though
      Operation Name=”GetEntityChangeDeltaList” TimeoutSeconds=”900″ / from current value
      OperationTimeout DefaultTimeoutSeconds=”300″ from current value
      Operation Name=”EndSnapshot” TimeoutSeconds=”3600″ / from current value
      Setting Name=”SnapshotSyncManagedEntityBatchSize” Value=”5000″ / from current value
      Setting Name=”SnapshotSyncRelationshipBatchSize” Value=”5000″ / from current value
      WorkItem Name=”SnapshotSynchronization” Enabled=”true” Shared=”true” FrequencySeconds=”86400″ TimeoutSeconds=”3600″ timeout from current value
      WorkItem Name=”DeltaSynchronization” Enabled=”true” Shared=”true” FrequencySeconds=”30″ TimeoutSeconds=”2000″ timeout from current value

      I’ll add a note to the post about this too, shame I didnt think of it before

      PS. I’ve been RSSing your blog for a longer while already 🙂

      Liked by 1 person

      1. Good to know. We have some of the same changes in our environment as well. First added in 2012, but w side-by-side migrated to 2016 so I wasn’t able to check if these changes was reset

        Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s