Macrium Support Forum

Site Manager / Agent v8 unreliable connection

https://forum.macrium.com/Topic48588.aspx

By DAVe3283 - 13 June 2021 3:50 PM

For some background, I have been running Macrium Site Manager (MSM) v7 for a long time, and it has generally been very reliable. I have a setup with a couple remote sites over a VPN, and some local computers at the main stie. With MSM v7 & the v7 agents, they worked great 95% of the time, with once in a while one of the agents on the WAN would drop the connection to MSM (but continue to run the backup just fine).

Now that I have upgraded everything to v8, having a backup work correctly is now the fluke. Every computer on the WAN will drop connectin every backup, and now I am seeing computers on the same LAN drop!! For example, the computer REAPER-3900X is on a 10Gbps link to the Macrium Site Manager server and the storage repository for the backup. Yet it "timed out launching Agent" according to MSM (only to report the backup was successful a few minutes later):


You can also see the spew of errors from computers on the WAN. I suspect this is the same root issue as this post: https://forum.macrium.com/48093/Site-Manager-v805934-Constantly-dropping-connection

This is causing 2 major problems:
1. When MSM loses connection with an agent, it starts another backup. I have a rule of 1 backup at a time to avoid slamming the VPN link to the remote sites, but now I will have 2 or even 3 backups running at a remote site at a time, saturating their internet connection.
2. The constant barrage of errors makes it hard to know when something actually goes wrong.

I am currently running MSM 8.0.5973, but this behavior has occurred on every version of v8 since launch.
By Alex - 16 June 2021 2:41 PM

DAVe3283 - 13 June 2021 3:50 PM
For some background, I have been running Macrium Site Manager (MSM) v7 for a long time, and it has generally been very reliable. I have a setup with a couple remote sites over a VPN, and some local computers at the main stie. With MSM v7 & the v7 agents, they worked great 95% of the time, with once in a while one of the agents on the WAN would drop the connection to MSM (but continue to run the backup just fine).

Now that I have upgraded everything to v8, having a backup work correctly is now the fluke. Every computer on the WAN will drop connectin every backup, and now I am seeing computers on the same LAN drop!! For example, the computer REAPER-3900X is on a 10Gbps link to the Macrium Site Manager server and the storage repository for the backup. Yet it "timed out launching Agent" according to MSM (only to report the backup was successful a few minutes later):


You can also see the spew of errors from computers on the WAN. I suspect this is the same root issue as this post: https://forum.macrium.com/48093/Site-Manager-v805934-Constantly-dropping-connection

This is causing 2 major problems:
1. When MSM loses connection with an agent, it starts another backup. I have a rule of 1 backup at a time to avoid slamming the VPN link to the remote sites, but now I will have 2 or even 3 backups running at a remote site at a time, saturating their internet connection.
2. The constant barrage of errors makes it hard to know when something actually goes wrong.

I am currently running MSM 8.0.5973, but this behavior has occurred on every version of v8 since launch.

Hi,

As you've correctly diagnosed, something is causing disconnections of the agent - there are some network-level things that can cause this, but if the same config worked with Site Manager 7, then it's likely that it's a bug somewhere since there have been no major changes to the Site Manager -> Agent comms protocol between 7 and 8. 

It's likely that the log files on the Agent computer have some useful information here - could you open a support ticket at https://macrium.com/support so one of our support team can transfer them? If you include a  link to this forum post in the ticket, we can make sure it gets the appropriate attention. If you've already raised a support ticket, you can PM me the ticket number and I'll see if I can follow up on it.

One option that may work is to revert your Agent to running the version 7 agent temporarily. Our support team can assist with getting and installing the V7 agent.
By DAVe3283 - 16 June 2021 7:44 PM

Support ticket created.
By Pioneer - 29 July 2021 3:16 PM

DAVe3283 - 16 June 2021 7:44 PM
Support ticket created.

Sorry to bring this up mate but did you ever get this sorted in the end? We're on the verge of resorting to a fresh install due to the same issue. Cheers.
By Alex - 29 July 2021 3:23 PM

Pioneer - 29 July 2021 3:16 PM
DAVe3283 - 16 June 2021 7:44 PM
Support ticket created.

Sorry to bring this up mate but did you ever get this sorted in the end? We're on the verge of resorting to a fresh install due to the same issue. Cheers.

Hi - is this something you have an open support ticket for? We are seeing a few of these issues which are proving to be problematic to track down, we'd like to get as many sources of data as possible to work it out.
By Pioneer - 29 July 2021 3:38 PM

Alex - 29 July 2021 3:23 PM
Pioneer - 29 July 2021 3:16 PM
DAVe3283 - 16 June 2021 7:44 PM
Support ticket created.

Sorry to bring this up mate but did you ever get this sorted in the end? We're on the verge of resorting to a fresh install due to the same issue. Cheers.

Hi - is this something you have an open support ticket for? We are seeing a few of these issues which are proving to be problematic to track down, we'd like to get as many sources of data as possible to work it out.

Hi, Thanks for getting back to us so quickly. Yes we created a support ticket at the same time as posting in here. We're currently uploading support information with a case number from the site manager. Cheers
By DAVe3283 - 29 July 2021 4:11 PM

I had a support ticket open and they were investigating it, but then they closed the ticket. Not sure if that was a side effect of them changing support software or giving up.

I would also like to get this fixed up. V7 was very reliable, but something they changed in the V8 network code is just not up to snuff.
I also noticed that if I restart the MSM server, instead of waiting for offline computers to connect, it fails the scheduled backup task, so when they do come online, they don't back up as no task is waiting. But after MSM has been on a few days, that part starts working right again.

I am also considering reverting to V7. A backup system that doesn't reliably back up computers kind of defeats the purpose...
By Alex - 29 July 2021 4:23 PM

DAVe3283 - 29 July 2021 4:11 PM
I had a support ticket open and they were investigating it, but then they closed the ticket. Not sure if that was a side effect of them changing support software or giving up.

I would also like to get this fixed up. V7 was very reliable, but something they changed in the V8 network code is just not up to snuff.
I also noticed that if I restart the MSM server, instead of waiting for offline computers to connect, it fails the scheduled backup task, so when they do come online, they don't back up as no task is waiting. But after MSM has been on a few days, that part starts working right again.

I am also considering reverting to V7. A backup system that doesn't reliably back up computers kind of defeats the purpose...

Hi,

Sometimes our support system closes tickets if there hasn't been a response or there's been a delay in the case. Replying to the email should get things moving again, or it may be due to the move between support systems. It might be worth a follow up email to the ticket just in case.

We have an internal build which fixes the issue of offline computers not correctly deferring backups on system startup - our support team could make that available to you. The comms issue is more complicated as we have made no changes to the network code between version 7 and 8 - as you say, it was reliable so we didn't see a need to change it. What seems to be happening is that some new code is interacting with the network code in a different/incorrect way, causing the problem. We've done some work to analyze this and have created some internal test builds which may help. If you're interested, the support team can get you access to that to see if it helps.
By Pioneer - 30 July 2021 8:33 AM

Alex - 29 July 2021 4:23 PM
DAVe3283 - 29 July 2021 4:11 PM
I had a support ticket open and they were investigating it, but then they closed the ticket. Not sure if that was a side effect of them changing support software or giving up.

I would also like to get this fixed up. V7 was very reliable, but something they changed in the V8 network code is just not up to snuff.
I also noticed that if I restart the MSM server, instead of waiting for offline computers to connect, it fails the scheduled backup task, so when they do come online, they don't back up as no task is waiting. But after MSM has been on a few days, that part starts working right again.

I am also considering reverting to V7. A backup system that doesn't reliably back up computers kind of defeats the purpose...

Hi,

Sometimes our support system closes tickets if there hasn't been a response or there's been a delay in the case. Replying to the email should get things moving again, or it may be due to the move between support systems. It might be worth a follow up email to the ticket just in case.

We have an internal build which fixes the issue of offline computers not correctly deferring backups on system startup - our support team could make that available to you. The comms issue is more complicated as we have made no changes to the network code between version 7 and 8 - as you say, it was reliable so we didn't see a need to change it. What seems to be happening is that some new code is interacting with the network code in a different/incorrect way, causing the problem. We've done some work to analyze this and have created some internal test builds which may help. If you're interested, the support team can get you access to that to see if it helps.

Hi Alex,

Yes, we would be interested in giving one of the test builds a go and seeing if it manages to fix the disconnecting issue. Would we ask for this through our support case?

Currently we've been trying to get the support information from our site manager over to you, however site manager is failing to send it over and trying to upload it onto our support case is met with a "file limit exceeded" error. We've had to resort to using a WeTransfer link to send it over to you but haven't heard back. Is there a recommended way we can send the info over if not through the site manager? Cheers.
By Alex - 30 July 2021 8:45 AM

Pioneer - 30 July 2021 8:33 AM
Alex - 29 July 2021 4:23 PM
DAVe3283 - 29 July 2021 4:11 PM
I had a support ticket open and they were investigating it, but then they closed the ticket. Not sure if that was a side effect of them changing support software or giving up.

I would also like to get this fixed up. V7 was very reliable, but something they changed in the V8 network code is just not up to snuff.
I also noticed that if I restart the MSM server, instead of waiting for offline computers to connect, it fails the scheduled backup task, so when they do come online, they don't back up as no task is waiting. But after MSM has been on a few days, that part starts working right again.

I am also considering reverting to V7. A backup system that doesn't reliably back up computers kind of defeats the purpose...

Hi,

Sometimes our support system closes tickets if there hasn't been a response or there's been a delay in the case. Replying to the email should get things moving again, or it may be due to the move between support systems. It might be worth a follow up email to the ticket just in case.

We have an internal build which fixes the issue of offline computers not correctly deferring backups on system startup - our support team could make that available to you. The comms issue is more complicated as we have made no changes to the network code between version 7 and 8 - as you say, it was reliable so we didn't see a need to change it. What seems to be happening is that some new code is interacting with the network code in a different/incorrect way, causing the problem. We've done some work to analyze this and have created some internal test builds which may help. If you're interested, the support team can get you access to that to see if it helps.

Hi Alex,

Yes, we would be interested in giving one of the test builds a go and seeing if it manages to fix the disconnecting issue. Would we ask for this through our support case?

Currently we've been trying to get the support information from our site manager over to you, however site manager is failing to send it over and trying to upload it onto our support case is met with a "file limit exceeded" error. We've had to resort to using a WeTransfer link to send it over to you but haven't heard back. Is there a recommended way we can send the info over if not through the site manager? Cheers.

Hi,
I've chased up with our support team - they've got the info needed and I'm making sure they've got the test build information to forward on