I am attempting to improve the stability of our SDM environment by removing numerous duplicate organisations clogging our Organization table.
A bit of a background behind my questions:
Long ago, well before I became a part of my team, there was a concerted effort by the organisation to unify contacts throughout the organisation. To do so for our SDM environment, they wrote a script to create a data load that would run to create contacts, organisations and a number of other objects. The script appeared to work, however there was a bug they found, later that caused a couple dozen orgs to be replicated hundreds of thousands of times. This generated nearly 2 million organisations, most of which had duplicate names. For some reason, this change stuck and was not rolled back or otherwise corrected.
Present day, the SDM tool is seeing more use and activity than ever before. There is a lot of management, of things like CI's, which rely on attaching or managing Organisation links and Service Contracts. This is prompting queries on the enormous organisation list, upsetting the performance and stability of our environment. I set out to try and cull this list to help optimise and stabilise the tool.
The duplicate organisations were all children of one parent organisation with a generic name, as well as being linked to a single Service Contract. The only objects that appeared to be directly linked to a child organisation were the Contact's Facility attribute. The Contact's administrative orgs were set to the parent organisation.
1.Set any contact that was linked to a duplicate or to instead link to the parent org. List generated by an extract listed about 6000 contacts. pdm_load the parent org uuid.
Where the plan fell apart:
Technically, it all worked on our test environment. The UI is much more stable when reviewing the organisation list and Service Contract, and does not seem to show any ill effects.
However, the server logs went nuts. After any recent restart, these warnings would generate:
domsrvr:hw 2976 WARNING producer.c 4184 table_name() failed for attr_name=cnthandling_list
Generally 220k at a time, reoccurring every hour from the restart time, for a few days.
With some of these sprinkled in:
domsrvr:hw 4988 WARNING producer.c 4159 attr_by_key() failed for attr_name=tenant; producer=cnt
Within 24 hours the Object manager would crash and restart. It does not seem to experience additional restarts despite being hammered with further messages.
SDM logs eventually settle and produces messages only periodically throughout the day, in ranges of a few dozen to 16k.
These messages seemed to have started when step (1.) was completed. Steps (2.) and (3.) were completed as well, but did not seem to change the behaviour. I started by investigating the first message addressing the cnthandling_list attribute. Backtracking to the usp_special_handling table, I found the BREL cnthandling_list which linked to usp_contact_handling SREL special_handling. I found only one contact that had a special handling configured for it that was part of the contacts managed for this organisation cull. Reverting the contact, Organisation and usp_organization to its original state did not improve the situation.
I have checked the second message, but we are a single tenant setup. If it is looking at the ca_contacts for ‘tenant’, all of our contacts have tenant = null. Another dead end.
Another list of warnings started to appear a week later:
domsrvr 4372 WARNING val_attr.c 1127 Unable to invoke ATTR_INIT trigger for heat_overlay - value may be incorrect
domsrvr 4372 WARNING top_ob.c 3714 Attempt to CI with no CO
domsrvr 4372 WARNING qrel_attr.c 1716 setup_attribute_hotlinks called more than once for QREL attached_sla.attached_slas
domsrvr 4372 WARNING qrel_attr.c 1716 setup_attribute_hotlinks called more than once for QREL tgt_time.target_times
domsrvr 4372 WARNING qrel_attr.c 1716 setup_attribute_hotlinks called more than once for QREL tgt_tgttpls_srvtypes.tgttpls
domsrvr 4372 WARNING producer.c 4184 table_name() failed for attr_name=cnthandling_list
Lastly, now every time we change any attribute for any contact, we get a list of these messages:
domsrvr:hw 2976 WARNING val_attr.c 1127 Unable to invoke ATTR_INIT trigger for pgroups - value may be incorrect
domsrvr 4372 WARNING val_attr.c 1127 Unable to invoke ATTR_INIT trigger for pgroups - value may be incorrect
domsrvr:wsp 2448 WARNING val_attr.c 1127 Unable to invoke ATTR_INIT trigger for pgroups - value may be incorrect
domsrvr:02 1136 WARNING val_attr.c 1127 Unable to invoke ATTR_INIT trigger for pgroups - value may be incorrect
What do these messages indicate? The messages seem to only show a symptom of a problem, but do not detail anything about what is actually at fault. Everything in the UI seems to work, so there is no clear indication why the warnings are generated. What would you suggest I do to investigate further to resolve the current situation? Is there something I should have done differently to deploy this better?
You may want to consider opening a support ticket for better understanding and troubleshooting this scenario. Basically I see a lot of mess around the org table and need some investigation further to see how far the relationships have been tangled in this scenario.
I currently have this being looked at. The feedback so far; they have advised that it might be related to the partition or permission groups the contacts are part of.
I have tried detaching the Service Contracts and Organisations completely by their advice, but the behaviour is much the same.
Is it possible to find a unique key to identify the "junk" in your table? If so, I would make an SQL query to disable and rename the duplicates so that you can completely set them aside. I would make a "mass-update" directly in SQL ... I'm not sure that pdm_extract / pdm_load will be able to support as much data ... SQL rock ..
The pdm_load worked as expected, even if it It took a very, very long time. It only included two things in ca_contact load:
id and organization_uuid. Reviewing the database shows everything loaded correctly.
We could only identify them using their name, and even then, their names are part of 22 variations, each with tens of thousands of duplicates. We set up our extract list by using some key words that made them unique. The fact that they have a naming scheme and inactive '1' now makes them pretty unique. However, in our Prod environment, I need to find a way to detach the Contacts from these Orgs while preventing these issues in the process. I suspect performing this from the SQL DB would yield similar results.
Further investigation from our regional support suggests that there may be other things going on when modifications to the ca_contact table when changes are made through the UI. There may be some local references that get updated if done through the UI that does not happen when the table is directly modified. Updates to the ca_organization table, while a massive load, has not produced any indications that has caused a problem by making them inactive via pdm load. Though I will investigate updating ca_organization though SQL directly in the future.
So, when a contact is updated with various Orgs via the UI, what else gets this information? Can we update this other information resource in any other way? Webservices perhaps?
Everything in the behaviour and warning messages seems to suggest the issue revolves around the contact relationships to something. After the environment is restarted, it spits out these errors every hour for up to 3 days. Then up to 5 days, only every 12 hours. After a week, it seems to be completely quiet. There seems to be some other reference that is built dynamically, perhaps some sort of caching or local string?
I read your question and have felt your pain also, so I wanted to share my experience with you for whatever it may benefit you. My story was the department table was linked to a zLocation table with UUIDs which effectively made adding/deleting departments impossible through SOAP or the UI. A straight SQL load was the current and only available option. I ended up working with support to drop and recreate the department table (minus the old customization) and then used PAM to repopulate it with correct data (that same process is currently maintaining the data).
Is there a better way of performing this task?
IMO, this is free advice, and the old rule still stands about free advice being worth what you pay for it
I would purge the 2 mil. rows of junk data from the org. table because it's not gone til it's gone. That is not a trivial amount of data for a SQL engine, let alone the object layer of SD, even if it is marked inactive. If you can get away with using SQL to drop the rows, lucky you! Another option maybe using the archive and purge utility that is built in, I haven't used it heavily but I have purged a log table with 9 mil. rows without issue, and I currently have about 3 scheduled purges on various tables. ('purge only' is your friend as archive writes massive text files which in this case would be worthless) If all else fails you can recreate and repopulate the table(s) affected by UUIDs of the old Orgs. and service contract (IT PAM is capable of maintaining large numbers of rows of data it just takes some time), just know this option is a last resort. One last note, junk duplicate names may also create further complications in the future when trying to use web services to maintain existing or create new data.
SQL vs PDM load vs Web Services
As I understand the hierarchy of preference it goes from least preferred: SQL, (middle of the road, but archaic) PDM load, and Web Services being the current recommended method. Incidentally they also become less powerful but more fool proof as they go. PDM load is powerful, however the potential to load data incorrectly in such a way that it can't be modified or accessed by the object layer in the future is a concern to be aware of, (search the communities for 'pdm' or 'federated', truth be told the mess you're dealing with was probably created with pdm load ) One note on SOAP should you decide to use it, keep the return payloads as small as possible for performance sake, below is the area in a PAM operator to use.
Get rid of the junk data and reload only what you need with web services.(the web services should take care of the dependencies you mentioned in your last post, or at least alert you so you can handle them)
The code you write will most likely be much better quality, easier to maintain, and perform better than the code that created the mess!
All the best!
I have been able to resolve this issue with a combination of PDM loads and direct SQL data management.
Coordinating with our support vendor, we found that the pdm_load I had performed actually crashed part way through the loading of the 1.8Million rows. However, the loader kept processing corrupt data, and had modified information elsewhere, but it was not clear what or where. This was repeated in a virtual environment and it seems clear pdm_load simply cannot handle that much data, probably due to caching limitations. Incidentally, in preparation of reloading our UAT environment with PROD data, I rolled back some patches and performed a pdm_configure. The error spam that had plagued this environment since the corrupt load did not return after re-configuring and before loading the prod Database. Whatever pdm_configure does, it touched on the area the corrupted entries must have existed.
I have done extensive research and experimentation the ca_organization and usp_organization tables, and have found that surprisingly few links actually exist for them. This is cut down further due to the way these rows were added, they were not connected to much. Because of our hierarchy, we have a primary Organization and children orgs. Most of the links were being fed through the primary parent and the children were only linked through the usp_organization table and contacts. Once the contacts were assigned to the parent, and the usp_organization rows were removed, the orgs in the ca_organization table were completely isolated; they affected nothing.
I developed a protocol for removing the ca_organization rows that I have found to work without any errors or unusual behaviour:
1. Gather list of unique Organizations and their UUID. (22 were found in my example)
2. PDM_Extract all contacts with the duplicate org names as their "facility".
3. PDM_load the list of contacts, changing their existing organization to the parent Organization UUID.
4. In MS SQL, deleted all duplicate org rows from the usp_organization table with the exception of the 22 unique orgs.
5. In MS SQL, set inactive for all duplicate orgs, with the exception of the 22 unique orgs.
6. In MS SQL, deleted rows for each duplicate batch (22 batches). Condition of org name and inactive 1.
For each step I verified the environment was stable, produced no errors and behaved normally. Once the last step was completed, there were only a little over 3000 orgs left in the table. I have been working with the environment for a month, and no unusual behaviour relating to these tables has been found.
Thank you all for your input and advice. The base issue has been plaguing the environment for years, but no one wanted to open a can of worms trying to fix it. Turns out the solution is surprisingly simple, and just required some research, experimentation and caution.
Well done! A very nice example of patience, persistence, insightful analysis and craftsmanship in the resolution. You must feel very satisfied to have resolved what must initially have looked like an intractable mess.
Great work Sean!
SUPER GRLOADER (z_loader_sdm_v103_en.zip)