I am trying to finish up my automated failover process and wondered if anyone here might help me with a couple of tough spots.
I would like to have an environment where phones and calls automatically fail over to the backup server in the event of a problem with the production server. I have identically configured Thirdlane MTE servers in two cities. I backup my production system nightly and ship the backup to the backup server.
The highlevel steps to make the failover work are, basically, as follows:
1) Restore the backup file on the backup server.
2) Backup and restore voicemail messages on the backup server.
3) Switch the phones from one server to the other
4) Switch the inbound trunks from one server to the other.
I could automate 1) just by untaring the backup in the root of the backup server, but I noticed when I did that last night I invalidated my Thirdlane license. I had to reinstall the license file before I could use the server. Is the license stored somewhere in on the server in a place where I can exclude it from my backup or run a script to automatically put the correct license file back each time I restore a backup?
Step 3) is easy to accomplish with a little thing called an Edgemarc 4500 router from Edgewater Networks. I enable Survivability Mode on the 4500 and enter both server. If the Edgemarc detects that production is down it quickly switches the phones to register with the backup server.
I'm stumped with the other two. How difficult would it be to store the voicemail on a different server so that it is available from either production or backup? Worst case I guess I could set voicemail to always email messages out and not worry about customers not getting messages that are left on the server that is having problems. But, it would be nice to seperate the voicemail function from the rest of the server so that it could be available from anywhere.
How about moving the incoming trunks? I sip trunking provider lets me create trunks and assign DIDs to them at will. I normally create a seperate trunk for each tenant. My sip.conf file has a registration statement for each customer. For now, when I restore a backup I comment out the registration statements before starting Asterisk so that both servers aren't trying to register with the sip trunk provider. In a failover scenario I either assume production is dead or comment out the registration statements on it, and then uncomments the registration statements on backup. Is there an easier way? Does any of this make sense?
You would defiantly want the
You would defiantly want the trunks to switch over when the Edgemark's switch over. I think you can 'assume' that all the Edgemark's will failover at the same time. I would figure out exactly what the circumstances are for them to failover.....i.e. 30 seconds of not responding to option messages, is it based on ping response etc.
Once you have that determined you should be able to replicate the process on the backup box using sipsak or similar. Once the failover process is triggered you could run a script that will rewrite the trunks config file and reload asterisk to get calls going to the backup box.
The last step would to be sure the primary box doesn't continue to register. How would it know to stop trying to register for the trunks; what if it was only a 10minute ISP outage, etc. Switching back would be a challenge, I would guess the Edgemarks would switch back immediately when the primary box cam back alive.
I think planning for a complete outage of the primary box is a lot easier that planning for a partial software failure. If Asterisk is kind of working, it gets a lot harder for all the routers to failover at the same time etc.
I wouldn't worry about voicemails, when you untar the backup, they should go to the correct folder...any left that day would be lost in the case of a disk failure.
My 2 cents.
-M
Hmmm... Interesting, Matt.
Hmmm... Interesting, Matt. I hadn't thought about scripting a rewrite of the trunk files. That may work.
Here are the survivability settings on the Edgemarc:
Time (s) between DNS lookups: 60
Time (s) between Keepalive messages: 5
Time (s) to declare Keepalive message lost: 5
Number of missed messages to declare alarm: 5
Number of received messages to clear alarm: 10
Interpret error code as success: 0
So, it sends a ping every 5 seconds. After 5 failed pings it cuts over to the backup server. After 10 successful pings it switches back to the primary. Good point, also, about the voicemail being lost in a disk crash.
Chris, I haven't looked at realtime before now, but on the surface it looks like it would be hard to implement with MTE. My 4500s are on the customer premise instead of in front of the Asterisk box.
Thanks for the comments, guys.
Dan
'"User not found" message is
'"User not found" message is most likely related to a missing webmin user. When user extensions are created, webmin users are created as well (that is what allows them to login and access the user portal).
By default Webmin user information is stored in /etc/webmin and /etc/webmin/asterisk directories.'
This brings up a good point.
I wonder if the webmin users are restored when you do a restore. I don't think so because the directories mentioned above are not part of the tar.
-Matt
Webmin users dont seem to
Webmin users dont seem to be backed up, Matt. I moved a customer over to my backup server last week because of a problem. The only two problems with the move were that I had to manually move my inbound trunks and the webmin users didn't move.
you need to go into webmin
you need to go into webmin and have webmin back up its webmin modules to restore the webmin side of things. You can also use this backup to grab other files (like your firewall script and provisioning directory).
you should also have your mysql database doing nightly dumps from webmin.
this way, in the event of catastrophic failure you can restore all webmin users, provisioning files, asterisk files, and databases (like CDR)
Webmin users
I looked at the code - webmin users are backed up. It was not the case in the earlier versions of PBX Manager - which version are you running?
The relevant webmin directories are /etc/webmin and /etc/webmin/asterisk. Please check if they are in the archive.
Best regards,
Alex
I'll check that. I'm
I'll check that. I'm running 6.0.1.36. If I schedule a backup in Thirdlane does it pick up all tenants? I may be doing my backups manually because I wasn't sure if it was picking up everything. And, in doing so I may have left out webmin.
eek! thats a beta version of
eek! thats a beta version of pbx manager, you might want to upgrade to current version lest ye run into a lot of bugs the rest of us will have never seen and cant reproduce :)
backing up everything
I know this might sound like a stupid question but why are you not backing up the whole box, I mean everything!!
using the TL backup is fine BUT its not going to get you everything you need in case of a major crash,, there are many free software packages out there to get the whole system not just peaces..
George
Hi Dozment,
I am faimilar with what you are trying to accomplish. I have not done this with MTE, but prior to MTE I was using 2 asterisk servers 1 active and 1 passive. Both servers were using the same mysql database and I was using realtime for sip, extensions, voicemail, etc. In front I used a Ranch Networks SBC which did monitor the servers and failover automatically in the event the active server went offline. I am not sure how to accomplish this with MTE as it is writing directly to the conf files. Also the 4500 way be a bit limited for a hosted pbx, you may want to look at the edgeprotect series. I have used 4500 on the CPE side and now there 200 series.
Cheers,
Chris A