Hi everyone,
I am attempting to improve the stability of our SDM environment by removing numerous duplicate organisations clogging our Organization table.
A bit of a background behind my questions:
Long ago, well before I became a part of my team, there was a concerted effort by the organisation to unify contacts throughout the organisation. To do so for our SDM environment, they wrote a script to create a data load that would run to create contacts, organisations and a number of other objects. The script appeared to work, however there was a bug they found, later that caused a couple dozen orgs to be replicated hundreds of thousands of times. This generated nearly 2 million organisations, most of which had duplicate names. For some reason, this change stuck and was not rolled back or otherwise corrected.
Present day, the SDM tool is seeing more use and activity than ever before. There is a lot of management, of things like CI's, which rely on attaching or managing Organisation links and Service Contracts. This is prompting queries on the enormous organisation list, upsetting the performance and stability of our environment. I set out to try and cull this list to help optimise and stabilise the tool.
The duplicate organisations were all children of one parent organisation with a generic name, as well as being linked to a single Service Contract. The only objects that appeared to be directly linked to a child organisation were the Contact's Facility attribute. The Contact's administrative orgs were set to the parent organisation.
The plan:
1.Set any contact that was linked to a duplicate or to instead link to the parent org. List generated by an extract listed about 6000 contacts. pdm_load the parent org uuid.
- pdm_load to inactivate all duplicate orgs (2mil) except 22 distinct organisation names.
- pdm_load remove all duplicate orgs from the usp_organization table, effectively removing the association with the Service Contracts.
Where the plan fell apart:
Technically, it all worked on our test environment. The UI is much more stable when reviewing the organisation list and Service Contract, and does not seem to show any ill effects.
However, the server logs went nuts. After any recent restart, these warnings would generate:
domsrvr:hw 2976 WARNING producer.c 4184 table_name() failed for attr_name=cnthandling_list
Generally 220k at a time, reoccurring every hour from the restart time, for a few days.
With some of these sprinkled in:
domsrvr:hw 4988 WARNING producer.c 4159 attr_by_key() failed for attr_name=tenant; producer=cnt
Within 24 hours the Object manager would crash and restart. It does not seem to experience additional restarts despite being hammered with further messages.
SDM logs eventually settle and produces messages only periodically throughout the day, in ranges of a few dozen to 16k.
These messages seemed to have started when step (1.) was completed. Steps (2.) and (3.) were completed as well, but did not seem to change the behaviour. I started by investigating the first message addressing the cnthandling_list attribute. Backtracking to the usp_special_handling table, I found the BREL cnthandling_list which linked to usp_contact_handling SREL special_handling. I found only one contact that had a special handling configured for it that was part of the contacts managed for this organisation cull. Reverting the contact, Organisation and usp_organization to its original state did not improve the situation.
I have checked the second message, but we are a single tenant setup. If it is looking at the ca_contacts for ‘tenant’, all of our contacts have tenant = null. Another dead end.
Another list of warnings started to appear a week later:
domsrvr 4372 WARNING val_attr.c 1127 Unable to invoke ATTR_INIT trigger for heat_overlay - value may be incorrect
domsrvr 4372 WARNING val_attr.c 1127 Unable to invoke ATTR_INIT trigger for heat_overlay - value may be incorrect
domsrvr 4372 WARNING top_ob.c 3714 Attempt to CI with no CO
domsrvr 4372 WARNING qrel_attr.c 1716 setup_attribute_hotlinks called more than once for QREL attached_sla.attached_slas
domsrvr 4372 WARNING qrel_attr.c 1716 setup_attribute_hotlinks called more than once for QREL tgt_time.target_times
domsrvr 4372 WARNING qrel_attr.c 1716 setup_attribute_hotlinks called more than once for QREL tgt_tgttpls_srvtypes.tgttpls
domsrvr 4372 WARNING producer.c 4184 table_name() failed for attr_name=cnthandling_list
domsrvr 4372 WARNING producer.c 4184 table_name() failed for attr_name=cnthandling_list
Lastly, now every time we change any attribute for any contact, we get a list of these messages:
domsrvr:hw 2976 WARNING val_attr.c 1127 Unable to invoke ATTR_INIT trigger for pgroups - value may be incorrect
domsrvr 4372 WARNING val_attr.c 1127 Unable to invoke ATTR_INIT trigger for pgroups - value may be incorrect
domsrvr:wsp 2448 WARNING val_attr.c 1127 Unable to invoke ATTR_INIT trigger for pgroups - value may be incorrect
domsrvr:02 1136 WARNING val_attr.c 1127 Unable to invoke ATTR_INIT trigger for pgroups - value may be incorrect
What do these messages indicate? The messages seem to only show a symptom of a problem, but do not detail anything about what is actually at fault. Everything in the UI seems to work, so there is no clear indication why the warnings are generated. What would you suggest I do to investigate further to resolve the current situation? Is there something I should have done differently to deploy this better?