Sometimes the best answer to a problem is to hit the reset button, but it should probably be the last answer, not the first.
My cohort Pete Silva attended the 2009 Cloud Computing and Virtualization Conference & Expo and offered up a summary of one of the sessions he enjoyed (‘Cloud Security - It's Nothing New; It Changes Everything!’ (pdf)) in a recent post, “Virtualization is Real”
One of the sessions I enjoyed was ‘Cloud Security - It's Nothing New; It Changes Everything!’ (pdf) from Glenn Brunette, a Distinguished Engineer and Chief Security Architect at Sun Microsystems.
…
Scale
– Today Security administrators deal with 10’s, 100’s, even 1000’s of
servers but what happens when potentially tens of thousands of VM’s get
spun up and they are not the same as they were an hour ago. Security
assessments like Tripwire, while work, inject load and what if those
servers are only up for 30 minutes? How can you be sure what was up
and offering content was secure? One
idea he offered was to have servers only live for 30 minutes then drop
it and replace. If someone did compromise the unit, they’d only have a
few moments to do anything and then it’s wiped. You
can keep the logs but just replace the instance. Or, use an Open
Source equivalent every other time you load, so crooks can’t get a good
feel for baseline system.
The “scale” we’re
talking about is a combination of scaling processes and systems. We
don’t often talk about the impact of large-scale environments on
processes but security processes are almost always the hardest hit as
an environment grows because of the sheer volume of data and systems
involved. That said, Glenn’s idea to only allow servers to “live” for
30 minutes is an interesting one, and I am going back and forth between
“that’s a good idea’ and “that’s a bad idea” and “there’s got to be a
better way.”
THE GOOD
One
of the reasons this is a good idea is because virtualization provides a
snap-shot in time, a known state, a known security posture for the
applications deployed within the virtual container. By releasing it and
launching it anew, you are assured of the security of the application
and environment because it you essentially go back to the beginning.
Any changes to the system since the last “launch” are effectively wiped
out (logging to an external storage system would be a requirement, of
course) and any back-doors, trojans, malware, or rootkits dropped onto
the system would be gone.
That would frustrate the heck out of an attacker, wouldn’t it?
But
it would also likely frustrate the heck out of end-users who might have
been using the application at the time it was released.
THE BAD
There
are a couple reasons this is just a bad idea, and the impact on
availability to end-users is just the most obvious one. In a live
environment it’s never a good idea to just “bring down” an instance of
an application – virtual or traditional – that users might be
accessing. Doing so severs their connections and wipes out any session
state that might have been stored on the server and forces them to
“start again”. That said, if you knew this part of your
security strategy you could ensure that developers understood this
behavior so that the implemented a database-based shared-session model
for the applications. If session data is stored in a shared database –
on a separate instance – then the potential damage to user sessions is
mitigated because it does not rely on any given application instance.
Assuming
this is the case, you then have to be concerned about the loss of the
connection to the application for users. Again, if you knew this was
going to be one of your security techniques then you’d best let the
network or application delivery network
folks know ahead of time as they can ensure that users are seamlessly
redirected to new (or other existing) instances as soon as the one they
were connected to is released. Basically you’d have to ensure you had a
load balancing solution in place to ensure reliability of access to the application.
This also means it’s more likely you should always have two instances of the application available, and rotating through this up-down-up-down schedule on different time intervals.
Overall
you’re likely to incur higher costs with this kind of a strategy as
well. It is typical for providers to charge “by the hour” and any
partial hour is counted as a full hour. Rotating server/application
instances every half-hour would likely incur charges for two instances
per hour instead of one anyway.
THE UGLY
This
strategy also does very little to address the most pressing security
threat facing applications today: tainted user data. That’s going to
hit the database, and unfortunately Glenn’s “go back to the beginning”
approach to security would be disastrous when applied to virtual
environments in which a database is running. You want them to change, to grow, to be modified. It is in their nature to store data and change over time.
So
you can’t use this concept for a virtualized environment in which a
database is deployed. It would be detrimental to the health of the
business.
But there’s something to Glenn’s idea
that’s certainly appealing when part of a broader security strategy.
What his “up-down-up” technique is designed to prevent is compromise of
the system, i.e. trojans, worms, viruses, and malware inserted
into the system that can be used for illegitimate access or as part of
a larger botnet. HIs technique certainly addresses those security risks
by effectively wiping them out on a regular basis. What’s not accounted
for is the injection of malicious code into the database, which cannot be so easily “reset.”
Perhaps this is a job for Infrastructure 2.0?
INFRASTRUCTURE 2.0 IS MORE THAN JUST NETWORK STUFF
If
we employ the use of an infrastructure 2.0 capable application delivery
network we can utilize Glenn’s technique in conjunction with other
security technology to provide better coverage in a more dynamic way.
Consider that the integrated network and application network security
capabilities of the application delivery network can protect application instances against web application attacks, especially those that are really targeting the database, e.g. SQL injection.
Also
consider that an application delivery solution can provide the failover
capabilities required to assure availability in an environment in which
instances may be going down and coming up in a highly volatile pattern.
That addresses the “bad” and the “ugly” impact on end-users
resulting from Glenn’s “up-down-up” technique, leaving us only with the
“good”.
But it really doesn’t address the root of the
problem, the reason Glenn suggests going back to the beginning in the
first place: volatility and change. Scaling security processes across
thousands of virtual instances is problematic, I agree, but one of the
reasons it’s so hard to scale is that you don’t know what’s going on.
There’s currently no real collaboration across the entire
infrastructure. Security folks can’t get a good feel for what’s going
on in a large scale, dynamic environment because the information they
need to correlate and assess the current security posture of the
environment and applications is dispersed across the infrastructure.
What’s
needed is an overarching system that can integrate security solutions
with the rest of the infrastructure. When a virtual environment is
brought on line the security infrastructure needs to know about it –not
just to apply the proper policies but also to assess its current
posture and ensure it is added to the pool of resources that needs to
participate in the larger security scheme. If a HIPS (Host Intrusion
Prevention System) is used to monitor a system for intrusion and its
alarm is triggered, that information
needs to be imparted to the rest of
the infrastructure. If a virtual machine is potentially compromised it
should be immediately removed from the available pool of resources.
That requires collaboration across the entire infrastructure. If part
of the launch process includes a vulnerability scan of the application
and that scan comes back positive perhaps the instance should not be
allowed to launch, and the infrastructure notified immediately so that
it can take whatever steps are necessary, such as automatically virtually patching the vulnerability if possible and allowing the instance to launch while notifying security and developers that there’s a vulnerability in need of patching.
Cloud
computing and virtualization are going to force integration and
collaboration into the fore of architecture design necessarily. The
scale of systems using virtualization is growing and becoming less and
less manually manageable, which will inevitably result in more automation and orchestration at the infrastructure layer.
Let’s
not forget the myriad pieces of security software that provide valuable
information and threat mitigation are also part of the “infrastructure
2.0” family, as it were. We need to start thinking more broadly, more
strategically about how to leverage collaboration across
the disparate functional silos within IT to come up with better
solutions to address security and its associated scaling challenges in
a cloud computing environment.