Aug 12 2008

Server Monitoring - Take Care Of Server Intrusion

Posted by Mike Brunt at 8:29 AM
18 comments
- Categories: ColdFusion | JRun-J2EE

I have had several clients recently who have had issues which appear to be related to inopportune use of server monitoring.   We were all delighted when Adobe announced their server monitoring capabilities for ColdFusion 8 yet it became apparent that leaving on "memory tracking" in production could cause significant issues, I use this as an example of inopportune use of server monitoring.  Recently I had a client who's servers would suddenly go off line once they reached 10 threads in use, even though there appeared to be no great state of distress overall, free memory was fine.  This was not related to the server monitor in ColdFusion 8 but to a crash protection capability in another product.The main point of my post here is that in my opinion and experience with many, many clients server monitors should do nothing more than just that, monitor the server.  There is no magic bullet in any product that I have ever seen that fixes underlying issues.  In fact the more data a product offers often the more obfuscated the actual cause of problems.  If we are allowing something to shut-down our servers that is not a solution but a band-aid and I personally would not be trusting something to shut-down my production instances.  
The perfect product for me would give me the ability to watch request times, threading and memory use along with SQL passed along the JDBC driver in real-time.  Also to have something that allows the setting of warning thresholds to notify me of high thread or low memory situations  and also have the capability to save this information along with slow request information with the query string details to a log and database.  Lastly, something that gives me a plain, understandable translation of the meaning of critical stack traces.

That is it for me, that is enough to see and analyze any problems I have ever encountered for all the clients I have ever helped.

Comments

charlie arehart

charlie arehart wrote on 08/12/08 5:13 PM

Mike, rather than leave people to guess, can you detail what you mean about the problem of "crash protection capability in another product" causing a server to "suddenly go off line once they reached 10 threads in use"?

Since you're not talking about CF8's alerts (a feature they have akin to crash protection), and I don't think SeeFusion offers something akin to it, I suspect you're referring to FusionReactor.

But I can't think what CP feature could cause the server to go offline once 10 threads are used. It could send a notice, or it could abort threads, or it could queue future threads while the problem remained. I suppose if someone did the latter and didn't expect it, the queuing of requests could seem like "going offline". But hey, powerful tools come with responsibility. Let's not blame the tools for being misused.

Now, the "memory tracking" of CF8's monitor might be in a different class: it really can kill a server under even moderate load--but then judicious use of oft-missed filtering features could temper that.

But you conclude "If we are allowing something to shut-down our servers that is not a solution but a band-aid and I personally would not be trusting something to shut-down my production instances." Again, what are you referring to that's "shutting down" the server?

I wonder if instead you're referring to FusionReactor's Enterprise Scripting feature, which allows a monitoring server to control a monitored server if it detects it becoming unresponsive. [Some may say, "buy, Charlie, you sound defensive, jumping to the conclusion that it's FR". I'm just using my familiarity with the features of the 3 tools to sort through what wasn't said. :-) ]

In that case, yes, it can shut down a server that it deems has become unresponsive (more at http://www.fusion-reactor.com/fr/featurefocus/monitoredServerScripts.cfm). Again, though, it's just a tool. Many have long used home-grown or third party features like this (ping tools, which will restart the CF service if it fails to respond.) As with the CP abort/queue features, all such tools do need to be used carefully.

Let me be clear: I, too, tend to stay away from the features that either abort or queue requests, or restart the server, at least when first using a tool. I'd much rather use them the way you describe, to gather diagnostic info. And indeed, the main points of your perfect tool are what all three of the major tools do.

I just fear that it's possible someone reading your note could conclude that the "other unnamed" monitor tool is something to fear and avoid. That would be an unfortunate conclusion. All 3 of the major CF monitoring tools have their strengths and weaknesses, and all can be configured to have very low overhead and impact, just serving to provide info to help make better diagnosis and analysis.

But you go on to say that such tools "should do nothing more than just monitor the server". Well, that's certainly a philosophical difference among the tools. Even CF8's monitor can do more than just monitor, in its alerts features. People should be careful with them, yes, but it seems a bit much to assert that the tools shouldn't offer the feature at all.

Still I agree that it's good and fair to warn people (especially newcomers to a tool) to be careful of the more powerful features, and for that I'm glad to see you offer your thoughts here. Always thought-provoking! :-)
Mark Kruger

Mark Kruger wrote on 08/12/08 6:47 PM

Mike,

I agree with you pretty wholeheartedly that server monitors should monitor and not be tasked with killing thread or recovering resources automatically.

This sort of reminds me of the many uses of try/catch. It can be an appropriate tool for catching items outside of our control - or it can be a catchall for sloppy code :) In the same way I fear that sometimes the automation of alerts and actions that mitigate problems are really work arounds that mask deeper problems.

Of course if you are someone whose business is shared hosting (like the folks at Edge web for example) then sometimes the condition of the code is out of your control as well - so I guess I can see some circumstances where it would be a godsend. Still, over all I think you ar spot on.

-Mark
Mike Brunt

Mike Brunt wrote on 08/12/08 7:15 PM

@Charlie, thanks for your comprehensive response. The whole subject of passing JVM arguments and tuning is incredibly dense as you know. Here is a list showing most of those http://blogs.sun.com/watt/resource/jvm-options-list.html

The basis and ethos of SeeFusion was to avoid putting any dangerous possibilities into the tool so that it could be used in any way safely in all places. There is literally nothing you can do with SeeFusion that could endanger a CF server in any way.

All I need to do my job is a basic server monitor that will give me what I need and can in no way interfere with the uptime of a server.
Mike Brunt

Mike Brunt wrote on 08/12/08 7:20 PM

@Mark thank you for your comments. I realized when I was creating this blog post that it might be controversial so purposely avoided naming any product. This was prompted by one on my clients having server instances shut-down when in my opinion they would have recovered from the high thread count.
Mike Brunt

Mike Brunt wrote on 08/12/08 7:20 PM

@Mark thank you for your comments. I realized when I was creating this blog post that it might be controversial so purposely avoided naming any product. This was prompted by one on my clients having server instances shut-down when in my opinion they would have recovered from the high thread count.
Mike Brunt

Mike Brunt wrote on 08/13/08 3:59 AM

@Charlie my response last night was a bit inadequate, when I read it again, considering your comprehensive one. When I pointed out the numbers of different arguments that can be passed to the JVM I was trying to illustrate what a complex matter the managing of the JVM can be. To go to your point regarding FR I believe it was the thread queuing which was causing an instance to be taken off-line and I think that is because the clustering device probably determined that instance was unresponsive at that time. This was causing the client to lose instances from their cluster. They turned off crash protection and the instances almost always recovered from those high thread peaks. After years of working on instance-server tuning I am simply uncomfortable with something which is, in my opinion, getting too enmeshed in the server operation. As I state in my article here, I only need to monitor a server and I don't want to use anything in production that can cause an instance to go off-line. I am just very, very conservative in that respect and in most cases still just use enhanced logging to analyze and fix problems. These are my professional opinions culled from 8-9 years of analyzing and solving ColdFusion - JRun issues.
Mike Brunt

Mike Brunt wrote on 08/13/08 2:18 PM

@Charlie one more item I did not really fully address was my comment about my use of the term "band-aid". This was not directed at any product but what I was trying to say is that hiding underlying application problems by doing things such as re-starting CF on a regular basis is not a solution and I know you fully realize that. In addition, the default behavior of CF-JRun is to queue threads when the simultaneous request celing has been reached. What does Fusion Reactor do in addition to this default behavior with threads?
Sharon

Sharon wrote on 08/15/08 5:23 PM

Do you have a Coldfusion User group meeting? When and where? If not who would like to have a monthly meeting?
charlie arehart

charlie arehart wrote on 08/16/08 2:11 PM

Hey Mike, thanks for getting back to me. For some reason I wasn't getting all the comment emails for this thread (I got some, just not all).

As for whether and when to use FR's queuing, this is discussed at length in a back and forth exchange on the FR google group, in this thread:

http://groups.google.com/group/fusionreactor/browse_thread/thread/2f57342847400193/

The bottom line may be that CF started exposing its queuing features after FR (which has offered it for 6, 7, and 8). There is also a new feature in FR3 to monitor queued requests, something I don't see being offered in the CF8 monitor (which even if it does offer it, is only on CF8 Enterprise).

But as I said above, each tool has its strengths. I get why SF doesn't even offer the power to control requests. As I noted, even though FR and the CF 8 monitor both offer it, they are tools that must be used carefully--if at all. I tend to be in the later camp, too. Just didn't want anyone to think that just because FR had such power, its use otherwise should be discouraged or impuned. :-)

Hey, one last thing: any chance you may tone down the intensity of the Captcha you use? It's pretty tough to read. Many have gone to less obtrusive captchas (less lines, less circles), and not seen an increase in spam. Just a suggestion. If that's using Lyla as the captcha, I blogged about simple changes in its XML file that you could consider:

http://carehart.org/blog/client/index.cfm/2006/10/7/lyla_captcha_simplified_xml_file

Your call, of course. Just trying to help if you'd not considered this.
Mike Brunt

Mike Brunt wrote on 08/18/08 2:01 PM

@Charlie, thanks as always for taking the time to post comments here and I am sorry about the CAPTCHA thingie. Even with this fairly tough to use version I still get the occasional spam comment. I fully understand your point about taking care when using tools and I also can appreciate the possible benefits of expanded facilities in a server monitoring tool. To me, providing ways to use information from the JVM to adjust server behavior could be somewhat dangerous. Particularly when it is wrapped in such an impressive GUI as FusionReactor undoubtedly is, I hope that makes sense. I also think it would be good for you to put links here to any classes you give on using FusionReactor to help others use it well.
Mike Brunt

Mike Brunt wrote on 08/18/08 2:03 PM

@Sharon are you meaning in the Los Angeles area?
charlie arehart

charlie arehart wrote on 08/18/08 2:42 PM

Hey Mike, a couple of thoughts. First, about monitors and their features to control requests, you say "providing ways to use information from the JVM to adjust server behavior could be somewhat dangerous". I realize it may seem we were coming to agreement on the above, but I just feel I need to respond to this. And while I'm at it, a couple other thoughts have come to mind. All this is hoping to help folks reading the thread, not trying to pick any fight, of course. I know you know that. :-)

I just find curious the assertion that FR is "getting information from the jvm". It's tracking requests as they come and go, like SeeFusion, just watching how long they take to run, how many are running, and available memory. It then also offers options to queue or terminate requests if set limits for those are exceeded. (I guess you could say the memory stat comes from the JVM, but not the rest. Just not sure what the key point may have been there, so I'm kind of guessing.)

Now, the crux of this entry and the comments have been about tools giving users too much control to be able to shoot themselves and create more damage than good, right? Well, I didn't press the point before, but SeeFusion supports killing requests manually too, just like FR and the CF8 monitor. And that can certainly be misused or abused. Just being clear about things.

Maybe your concern is with such control being automated, and I have agreed that that requires still more care.

But then I'd ask: do you have the same concern about load balancing/failover and other monitoring solutions, which ping the server and often also offer options to restart a non-responsive service (or might instead just declare it out of the cluster, even when it may still be up)? Just seems worth getting that out there, as again we don't want to fault any one tool as being too onerous in its efforts to provide for keeping things running smoothly. It's easy with any such tool for it to be more trouble than value if it's not used carefully.

As for the training classes, thanks for asking.If anyone's interested, they're offered at http://www.fusion-reactor.com/fr/traininginformation.cfm. (I wonder if anyone may read your question as really being a surreptitious way to get me to admit I have close alliances to FR. I don't think that's what you meant, but I don't hide it. I speak about FR as much as about the CF8 monitor, and in the past also spoke about SeeFusion. Again, I see them all bringing value to the table, so my comments here in support of FR are just from one member of the community to the others.)

Your blogs (this and the old ones at Alagad, Teratech, and webapper) are among the few that discuss a lot of these server issues, so people are thirsty for info on the topics. I just want to add to the conversation and help keep people fully informed. We're both here to help. :-)

Finally, about the captcha, the sad truth is that some of them will come from humans simply trying to increase their search engine ranking by dropping a URL into a comment (or in the website field here, even if not in the comment). We as bloggers just need to eradicate those pests when they pop up. But really, then, the heavy captcha is hurting far more people than its helping, as clearly the spammers work through it regardless. Only if you were being hit by automated spammers would a weaker captcha start to hurt. Just a thought.

Cheers.
Mike Brunt

Mike Brunt wrote on 08/19/08 6:35 PM

@Charlie thank you once again for commenting so comprehensively, hopefully these comments will help all who read them.

In essence and in my opinion we are both correct. I believe that server monitors should provide only what is necessary to troubleshoot problems and nothing more. My beliefs come from 8-9 years of traveling the USA and overseas, troubleshooting and fixing issues on-site with clients. There is no doubt that it would be marvelous to take what I learned and do and put the methodologies I used to fix issues into a tool for others to use to achieve the same results. In fact were that possible or if I believed that were safely possible they would be in SeeFusion, as I co-owned Webapper when we launched SeeFusion. In addition, the senior engineer on the SeeFusion team, Daryl Banttari, spent more time than I did troubleshooting and fixing ColdFusion issues and he would love to be far richer, I am sure. However we both believed that loading up SeeFusion with many more options was not a good idea as it takes a server engineer to safely make many of those decisions. Our opinions and the product development that went into SeeFusion were drawn from well over 19,000 hours of working with over 100 clients on-site solving issues, it is important to state that. Your point about the ability to kill threads using SeeFusion is a fair one. This option in SeeFusion is very conservative. In fact my experience is that it only works around 30% of the time because SeeFusion is engineered to allow threads to be killed ONLY when the overall stability will be affected as a result.

The blog piece that I created here was well considered before I released it in fact, I purposely avoided mentioning FusionReactor at all. However what I was blogging about was something I observed whilst helping a client.

Finally, I no longer co-own Webapper nor do I have any financial interest of any kind in SeeFusion. It is simply so often invaluable to my clients in identifying the real causes of application performance and stability issues.
Mike Brunt

Mike Brunt wrote on 08/19/08 6:38 PM

What an unfortunate typo on my part, the final sentence in the last paragraph above should read "In fact my experience is that it only works around 30% of the time because SeeFusion is engineered to allow threads to be killed ONLY when the overall stability will NOT be affected as a result."
TN Pas Cher

TN Pas Cher wrote on 04/06/12 6:42 PM

I`m so grateful that you enlightened me and the most important thing that it happened in time. Just think, I have been us
the credit counseling services

the credit counseling services wrote on 01/04/13 3:45 AM

This is very good blog which has explained about Fibromyalgia Network. I am really very happy after getting this blog.
creation finance

creation finance wrote on 01/04/13 3:46 AM

When I read the title of your blog I became very much surprised but after reading whole blog, I understand this properly. Really very interesting content.

Write your comment



(it will not be displayed)



Leave this field empty: