Idea Details

Spectrum Archive Manager enhancement

Last activity 05-31-2019 01:54 AM
raphael.franck's profile image
09-26-2014 10:19 AM

Hello,

 

starting with Spectrum version 9.4, it is officially supported to run an ArchMgr on the primary and secondary SpectroSERVERs.

As of now, it seems that the primary SpectroSERVER will still store events locally, once the primary ArchMgr is unavailable but the secondary ArchMgr is running.

Additionally, once the primary ArchMgr is up again, it will not re-sync events from the secondary ArchMgr but from the locally stored events only.

We'd like see the ArchMgr enhanced to be a fully redundant/HA mechanism.

- if the primary or both ArchMgr run, then the primary should catch and provide and the secondary just catch the events

- if the secondary ArchMgr runs only, then it should catch and provide the events, once the primary gets back again, it should re-sync all missing events

- the Report Manager as well as any other event client should always connect to the lower precedence value ArchMgr

- the "locally stored events" mechanism should stay as gateway of last resort in case both ArchMgrs fail


Comments

03-02-2017 02:29 PM

Lilah,

As the local store fills up, we delete 250 of the oldest events to make room for newer events.  These events are stored in the SpectroSERVER database, so yes they persist.

 

-Todd

02-28-2017 04:53 AM

Thanks Todd.

 

So the latest events are stored locally (according to the threshold), and the older ones are destroyed to make room?

Are these events stored in RAM or can they persist in case of secondary server failure (with or without Archmgr)?

02-27-2017 09:30 AM

Lilah,

There has been no change to the locally stored events behavior, and there currently is not any sync from the backup to the primary ArchMgr.  Events will be stored locally( meaning in the SS database ) on both the primary or secondary SpectroSERVER when the primary ArchMgr is down.  We recommend setting this value to a value that would handle any primary ArchMgr downtime.

 

-Todd

02-19-2017 04:23 AM

Hi,

 

Just wanted to check if the enhancement in 10.2 allows for buffering and then updating beyond the maximum local events threshold. For example, if 50000 is your configured local events threshold in the secondary server, it will still store the rest to the local Archive Manager and then when the primary server is accessible, this will be synchronized back to the primary. 

 

I'm asking because previously the maximum event threshold was the buffer size during disaster recovery, and perhaps now it's not criticial.

12-18-2016 01:21 PM

Dear Spectrum Community Users,

 

This idea is a delivered with CA Spectrum 10.2 release. 

 

Thanks,

Nagesh 

07-08-2016 09:08 AM

1. SRM will only connect to the primary ArchMgr.

 

2. Both primary and secondary SS have the locally stored events mechansims to store events when the primary ArchMgr is down.

 

-Todd

07-08-2016 09:07 AM

This work will be part of 10.2

07-08-2016 04:32 AM

Hi Nagesh, Todd,

 

could you please elaborate on what exactly will "be delivered by next relase of Spectrum"?

The long thread above touches various aspects (some already delivered) and my last posting frmo April 1st still has some unanswered questions.

 

Thanks and best regards,

Raphael

04-07-2016 01:35 AM

Dear Spectrum Community Members,

 

This idea is planned to be delivered by next release of Spectrum.

 

Thanks,

Nagesh

04-01-2016 04:30 AM

Hi Todd,

 

good news!

Please would you be so kind to answer the remaining of my previous questions/concerns?

 

- the Report Manager as well as any other event client should always connect to the lower precedence value ArchMgr

=> you refer to "'100% reliable data source' used by SRM", is SRM hard coded to use primary/precedence 10 ArchMgr only?

 

- the "locally stored events" mechanism should stay as gateway of last resort in case both ArchMgrs fail

=> is the "locally stored events" mechanism existing on primary only? if yes, it then it would make sense to have that on secondary as well

 

best regards,
Raphael

03-17-2016 10:17 AM

Good News!

Thank you, Todd!

03-17-2016 09:01 AM

Frank,

I've made changes to handle the scenario you mention.  The sync process will still be from primary ArchMgr to secondary ArchMgr, but we will wait for locally stored events to be flushed before starting the sync.  I also added protection for a few other scenarios, so anytime there is a chance of event loss we will always trigger a resync to make sure no events are lost.

 

This should really bring us to full redundancy/HA.  I'm continuing to test, and will probably hand off to QA next week. 

 

-Todd

03-15-2016 06:57 AM

Hi Todd,

as far as I see, this may be a good compromise.

But...

...there is one question left:

The scenario is:

At least one SS is running, but both the primary and the secondary ArchMgr are down.

The secondary ArchMgr is back before the primary is back.

When the primary becomes active, it will receive the locally stored events.

The question is:

Is there a chance to send these events also to the secondary ArchMgr?

Regards, Frank

03-14-2016 04:50 PM

Folks,

                I am wrapping up the implementation details for this FT Archive Manager event sync, for inclusion in Spectrum 10.2.  If you would be interested in helping validate this functionality before 10.2 time frame, please reply to this thread.

 

                Here is the summary of the new functionality to fill in the missing `gaps` in my FT ArchMgr implementation:

 

1. Going forward, the secondary ArchMgr will keep track of the last primary event received.

2. When the secondary ArchMgr is restarted, we will read this last event time from the ddm database, and retrieve all events from the primary Archive Manager from this time forward.

     a. This is a background task, with low priority, so will not impede functioning of either ArchMgr.

3. A complete full-sync will be not be generally triggered, as this could be millions of events to sync.

     a. A full sync could be manually triggered, but unsure if this is warranted.

 

Please let me know if you have any questions or concerns.  Please note that this will require open communication between primary and secondary Archive Managers, something that was never required before.

 

Here is a sample of the planned output that will indicate sync status:

 

Mar 14 16:29:14 : ArchMgr successfully synced 23464 events( Feb 27 18:26:15 - Mar 14 16:29:07 ) from the primary ArchMgr in 35.557 seconds.

 

Thanks,

Todd

11-09-2015 03:08 AM

Hi Todd,

 

thanks for providing deatiled insight which is a good thing to have after 1 year since thread start.

I'd like to get back to my initial statements, please comment if I got things right or wrong.

 

- if the primary or both ArchMgr run, then the primary should catch and provide and the secondary just catch the events

=> existing

 

- if the secondary ArchMgr runs only, then it should catch and provide the events, once the primary gets back again, it should re-sync all missing events

=> partly existing, re-sync for secondary is missing

 

- the Report Manager as well as any other event client should always connect to the lower precedence value ArchMgr

=> existing, you refer to "'100% reliable data source' used by SRM", is SRM hard coded to use primary/precedence 10 ArchMgr only?

 

- the "locally stored events" mechanism should stay as gateway of last resort in case both ArchMgrs fail

=> "locally stored events" probably existing on primary only, would make sense to have that on secondary as well

 

your statement:

However, generally this shouldn't be a problem, ... and generally the ArchMgr is up anyhow.

=> The ArchMgr shall be up all the time, but in reality it isn't.

 

summary:

For now it's mostly about data gaps in secondary ArchMgr caused by not re-synching with a complete datasource. These gaps will definitely be introcuded on a regular basis due to operating system maintenance and more importantly CA recommended DDMdb maintenance. So based on the user's requirement to have the "Events" available for investigation all the time, this is a missing functionality.

 

 

best regards,
Raphael

10-26-2015 07:52 AM

Frank, maybe we need to talk about this... as either this functionality is not working properly or misunderstood.  All current events should be available and visible if the secondary ArchMgr is running.

 

Please drop me an email and we can schedule a quick chat - todd.kornely@ca.com.

 

Thanks,

Todd 

10-26-2015 04:13 AM

Todd,

thank you for the implementation of the FT ArchMgr functionality. It's a great step forward!

Our focus is not  using the ArchMgr with SRM.

We need the ArchMgr for analyzing events in OneClick. and there are scenarios, where we can't wait, until the primary is available again.

Here are an example:

Maintenance actions on both servers, e.g. OS-patches. Normally, we do  this first on the secondary machine and then on the primary. Some of this actions can need hours and if then the secondary machine is the only one, there is a leak in the visible events. Of course, this will be closed later, but as I said, sometimes the informations are needed immediately.

Other Scenarios, where only the secondary machine  is online, are:

- Simulated or real disasters in the datacenter, where the primary server resists.

- Reconstruction work in this datacenter on a Weekend

If there were any earlier planned or unplanned downtimes on the secondary side, than the operators will have the unwanted leaks during the downtime of the primary site.

 

Or to say it with other words:

We want have Spectrum with all components as a 7x24h service and therefore we don't want have any leaks in the event history, even if they are temporary or permanent.

 

Regards, Frank

10-23-2015 08:01 AM

Folks,

I implemented this FT ArchMgr functionality, and I wanted to clear up some points, as I do believe that this mechanism works well, and provides necessary HA functionality:

 

1. Whenever the secondary ArchMgr is running, it is receiving and storing current events.  Regardless of whether the primary is up or if the SS is locally storing events( this is just for the primary ArchMgr )

    a. So in the case where the primary ArchMgr is down, no events are lost, all events are stored in the secondary DDMDb and you have full access to these events during the primary ArchMgr downtime.

    b. When the primary ArchMgr comes up, the locally stored events are transferred to only the primary ArchMgr( as the secondary already has these events )

   c. This locally stored event mechanism works well and I don't see any reason to re-invent the wheel.  Are there problems with this?

 

2. The only deficiency I see in this design is that there is no sync for when the secondary ArchMgr is down - so whenever the ArchMgr is down, events will be missing on the secondary ArchMgr.  However, generally this shouldn't be a problem, as the primary ArchMgr is the '100% reliable data source' ( used by SRM ), and generally the ArchMgr is up anyhow.

 

I am certainly willing to make improvement here, I just wanted to be clear what the current functionality is.

 

Todd Kornely

Senior Principal Software Engineer

10-23-2015 06:52 AM

Hello Nagesh_Jaiswal

Here is one use case:

If only the secondary machine is available, it's important for the users that they have an Archive Manager without leaks.

If the primary is unavailable because of network problems, the operators may need all events to analyze the problems.

Regards, Frank

 

 


10-23-2015 06:35 AM

Hello Spectrum Community Users,

 

Sorry for delayed response on this. We did review this. And the initial assessment of this idea came out to be lot of effort and risk. We are however re-investigating it again.

 

There are few questions though during our internal reviews. We don’t entirely understand the reasoning/desire for this. What problems or business use case will this idea solve.

 

Thanks,

Nagesh

10-19-2015 01:56 PM

Over 100 votes!

Isn't it time for a new statement, CA?

10-19-2015 10:48 AM

Definate value and would be interested!

10-15-2015 07:42 AM

I would also appreciate that idea and wonder why this had not been implemented yet so far.

09-01-2015 11:36 AM

Any updates on this?

08-24-2015 03:00 AM

Hello Nagesh, CA,

 

something like 3 months ago you promised to get back on this "soon". The topic still seems to be of interest- we got more than 1000 views in total.

What is the current state of investigation/planning/development?

 

regards,

Raphael

08-04-2015 04:20 AM

Hello Nagesh,

what means "soon"?

In the last months we had a few situations, where the secondary ArchiveManager was helpful. But my users had also registered the leaks.

I could only say: Sometime should the leaks be filled. But I don't know, when.

Until  now, there are 87 votes for this idea  and I'm sure, I'm not the only one, who really waits for this function.

 

Regards, Frank

06-08-2015 02:13 PM

Definately a good idea !

06-08-2015 02:44 AM

Hello Nagesh, CA,

 

are there any news on reviewing this idea? Up to now, more than 800 views were counted for this one, so there seems to be quite some interest here.

 

regards,

Raphael

03-28-2015 05:17 AM

Thanks for posting the idea for CA Spectrum. We are currently reviewing this idea. We will get back on this soon.

 

Thanks,

Nagesh

02-18-2015 10:49 AM

Hi,

i hope testing this new feature (HA of events)

12-12-2014 03:46 AM

This will be very helpful.

12-03-2014 03:02 AM

This is exactly what makes the new functionality complete!