SOC Mistake #8: You Don’t Speak the Language of Business, You Speak the Language of Security

This is by far one of the most common failings of Security Operations.  I’ve reviewed the maturity of several large global Security Operations Centres and they appear to be doing a reasonable job of the prediction, detection and investigation of information security incidents – but none of this is visible to the rest of the organisation who funds their operational budgets.

It is common to find someone who has started life as an operational information security person, maybe originally a firewall or Intrusion Detection System administrator, who’s career ultimately takes them to SOC Manager.  Their life has been steeped in the operational reports produced by technical controls such as firewalls, Intrusion Detection Systems and anti-virus solutions.  These reports a meaningful to him, although I’d argue about the contextual value you can gain out of an individual control’s report, but count based metrics such as ’37 unauthorised access attempts across the business’ or ‘300,000 block spam emails’ are pretty meaningless to senior management.

I’m reminded of Monty Python’s Spanish Inquisition Scene set in Jarrow in 1911 where Graham Chapman enters and says to the mill owner: “One on’t cross beams gone owt askew on treddle” to which the mill owner, both unaccustomed to the regional dialect and technical jargon says “Pardon?“.

Graham Chapman’s character is looking to the mill owner for support and direction, but he’s presenting the problem in the operational language he understands.  If he’d said “A vital piece of the manufacturing equipment in our rail sleeper production has become mis-aligned halting production” instead of “I didn’t expect some kind of Spanish Inquisition” the problem would have got sorted and Cardinals Ximinez, Fang and Biggles would have never appeared.

It is the job of the Security Operations Centre, just like the rest of the information security function, to present meaningful decision support management information around information risk to the management and it’s the management’s responsibility to make decisions on risk based on it.  If we never provide information in a language or format they can utilise, we’re always going to be seen as these strange people who live in the basement and occasionally come into the boardroom and start speaking something that sounds like Klingon to the C-level execs.

The other issue is that information security risk isn’t the only risk that companies need to consider, even if we do try to treat ourselves like a special little snowflake.  An organisation’s risk function has to balance – liquidity risk; currency risk; supply chain risk; asset risk; competition risk; pricing risk; and capital availability to name but a few.  We’re often so wrapped up in our own little worlds that are so important to want we do and we vent when decisions don’t go “our-way” forgetting that the C-level suite are running a company whose main business probably isn’t information security.

A typical SOC

A classic example I can illustrate happened to myself.  I came into a budget meeting armed with a risk assessment and a budgeted control suite – risk to the business was around 800K, cost of controls were about 200K and residual risk would have been around 200K.  Job done the average quantitative risk wonk would say (it was a much more in-depth risk analysis than I am demonstrating here for the sake of brevity), but the issue was that if that 200K was invested in three more sales people they’d bring in much more than 800K in revenue to the organisation, which at the time was exhibiting now’t wrong with a Yorkshire accenta 98% customer retention rate, and future growth, and operational costs, was funded out of customer subscriptions.  When you took into the balance of information security risk vs. opportunity risk, my project was a bad call.

So the presentation of risk in a language the business understands and that allows a normalised comparison with other forms of risk, if you operate an Enterprise Risk Management framework, is one of the key success criteria for good security operations.
So what does good management information look like?  Well, financial metrics are a good start.  Everyone in the C-suite understand pounds, shillings and pence (excuse my pre-metric example, dollars and cents to my US friends) – it’s a good place to start.  Creating financial metrics has long been a difficult proposition, but there are several ways to do it.  Myself, I tend to map my SIEM event categories onto the VERIS framework.  This then lets me use the average costs and time-to-resolve metrics from the Verizon Data Breach Investigations Report, which I still consider to be one of the best yardsticks of what is going on in the wider world, to show my organisations performance against the average.

The other is that it most have context, providing count based metrics for the whole organisation doesn’t impart any information about what line-of-business assets are involved and what the potential bottom-line is to the business.  “37 unauthorised access attempts across Acme Corp” says one thing, “34 on cardholder processing systems”, “2 on bank transfer systems”, “1 on the customer relationship management system”, all of which are buried deep inside your infrastructure behind several layers of now’t wrong with a Yorkshire accentnow’t wrong with a Yorkshire accentcontrols, says quite another.  I’m going to talk more about this in a later blog posting, so I am going to part this here for a while.

Another aspect of the context is granularity, and this normally requires input from the analysts and incident responders and some form of established taxonomy for the more granular categorisation of incidents.  For instances saying you’ve blocked “34 malware infections” says one thing, saying “24 malware infections were stopped at the host level and we’re detected by the Intrusion Detection System”, or out of the other 10, “5 exhibited DNS behaviour showing they attempted to connect over port 443 to external systems” and that the other “5 encrypted the harddisks of systems in our payroll department just before payday”.

It’s not just about the granularity, it’s also about the curation.  Helping the exec’s understand what the impact is to the business is?; giving advice on what they could do about it?  who is the likely perpetrator is, based on the tools, techniques and procedures they are using, or at least provide an indicator of what their capability is?; understanding when this started, is it a part of a campaign or a single attack? Is it still ongoing?; how did this occur?  What vulnerabilities did the attacker exploit?  How could this be prevented from happening again?; the most difficult, and often most important question: why did this attacker attack us?  What were they after?

Having management information that allows information security risk to have a seat at the boardroom table with the rest of the functions that handle risk is a starting point. now’t wrong with a Yorkshire accent Providing context to the C-level execs enabling them to make informed risk decisions helps move security operations from a reactive function, to one that is proactive.  When this is coupled with the topics I’m going to talk about in my next couple of postings: providing line-of-business metrics and using threat intelligence, we’re moving from a Jarrow-accent to a Received Pronunciation one – although there’s now’t wrong with a Yorkshire accent as my fiancée is from Hull ;)

SOC Mistake #9: You don’t tier your SOC staff

Security Information and Event Management (SIEM) platforms are all about turning the mass of raw events that occur in your organisation’s infrastructure into intelligence that can be assessed by analysts and incident responders to identify and react to information security incidents.

SIEMs, despite what the vendors will tell you, are not magic.  It will take you months to tune your ruleset to eliminate the bulk of false positives and you’re probably working against a moving target of an increasing number of event sources as well as continually having to adjust the rules to detect the new threats you’re facing.

To ensure the maximum use of your highly-skilled trained analysts, it is common to tier your analysts into at least two layers.

The initial layer that are solely responsible (at least to start with) for the triage of incoming events.  That is the identification of false positives, ensuring the appropriate prioritisation and escalation.

In an effective SOC, however, these level 1 analysts are not simply “click-monkeys”, as well as triaging false positives they should be doing some form of initial assessment so they can evaluate the potential impact and scope of the incident.  They should also be performing some form of adversary characterisation by evaluating where in the attack chain the event was detected (further down the chain, such as at the command or control or lateral movement stage,  may imply that they have conducted significant reconnaissance and have crafted a specific exploit to be undetectable to your host or network Intrusion Detection/Protection System – this implies a motivated and fairly skilled adversary) and they should also be, from their initial investigation, ascertaining the potential impact to the business.

Often the SIEM will have some form of prioritisation algorithm based on a number of factors, but only a human analysts can take all of the context into consideration (Skill level of attacker? Does the attacker exhibit known behaviour in their Tools, Techniques and Procedures (TTP) that can assist with attribution?  What is the apparent intent of the attacker (disruption, theft, espionage)?  Is this a one-off event or part of a sustained campaign?  Does the attack demonstrate investment of a lot of time or funds (use of zero days, for instance)? What systems are effected and what line-of-business do they support?

Only events assessed as what the level 1 analyst deals real events are escalated to the next level of more skilled analysts to conduct a deeper level of investigation.  You can create specialisations at the Level 2, or above, layers to allow workflows to be created that direct events of a certain category to specific analysts, or groups of analysts.  Some organisations have as many as three or four tiers of analysts, gradually becoming more skills and specialised as you move up the chain.

Any false positives discovered by the analysts can be routed to content authors who can further tune the SIEM rules to try and prevent the false positive from occurring in the future.

The focus should be on making this process as efficient and repeatable as possible, while allowing the collection of metrics to support continual improvement.  For instance, in HP ArcSight, we create ActiveLists for a ‘triage channel’ and the ‘content needs tuning’.  As we’re largely automating this workflow we can collect metrics on key operational Key Performance Indicators such as time-to-triage, time-to-investigate, number of false positives per use case category, number of events escalated per analyst, number of incorrectly categorised false positives per analyst.  These metrics, when combined together, can help you achieve the right balance of efficiency and effectiveness.

We’ve evaluated dozens of Security Operations Centres were all of the analysts are highly trained and all operate at a single tier.  They all randomly pick the events they wish to work on off the console and do their typical ‘deep dive’ investigation.  This causes several problems:

  1. It’s hard to maintain but a broad-spectrum of investigatory skills needed for triage of all event types and a deep-level of specialisation to do a full investigation;
  2. The analyst may prefer to investigate specific categories of events, meaning that some event types may remain in the triage channel for extended periods of time;
  3. Having your highly-skilled analysts conduct the initial triage of false-positives is a bad use of their time; and
  4. Often Security Operations Centres find it really difficult to produce meaningful metrics on the overall performance of the capability, or individual analysts.

Implementing at least a two-tier system of triage/prioritisation and investigation can dramatically increase the performance of your Security Operations Centres.

 

 

5G/SOC Presentation at HP PROTECT Washington DC

I’ll be presenting session BB3055 “5G/SOC: How the world’s most advanced SOCs are leading the way” on Tuesday 5th September at 17:50 at HP PROTECT in Washinton DC – talk about a graveyard shift!

“If we’ve learned anything from all the media attention given to data breaches in the past few years, it’s that no matter who you are, someone out there wants to steal your critical data. The type of data varies, but everyone has something worth stealing. Today’s mature SOC teams are incorporating new technologies, sharing information, and expanding their focus outside of the enterprise to include the modeling of attacker activities and personas. We are now entering the fifth generation of security operations, or what we like to call the 5G SOC. Hear more about the 5G SOCs of today–which monitor more than ever before–and how they change the focus from simply monitoring systems to monitoring the actors perpetrating the attacks. Benefit from 5G SOCs looking beyond their enterprises’ borders and tracking activities in social media, changes in global politics, and shifts in attacker economics in order to discover threats and act on them.

SOC Mistake #10: You confuse your SOC with your NOC

Network Operations Centres (NOCs) are responsible for the operational monitoring of infrastructure and services. Their function is to identify, investigate, prioritise and escalate/resolve issues that could, or do, effect performance or availability. A Security Operation Centre (SOC) shares much in common with a NOC, it’s function is to identify, investigate, prioritise and escalate/resolve issues that could, or do, effect the security of an organisation’s information assets.

It is no surprise then that I am frequently asked by customers looking to build a SOC “Why can’t we use our NOC for this function?”. I can understand the motivation behind this question, once you’ve stood up your Security Information & Event Management (SIEM) platform, identified your use cases, got the right event sources feeding events into the SIEM and then got your SOC procedures nailed, the largest cost of running a SOC is typically headcount.

There are, however, a few reasons why a combined SOC and NOC isn’t always a good idea:

1. They serve different, often conflicting, masters.

Within organisations there is often a conflict between operations and information security teams – information security want to pull the plug on an compromised server that happens to be hosting a critical service; they want vulnerabilities patched as soon as they are available, often without fully testing the impact on operations; they can’t understand why dealing with an incident isn’t always the top priority for the operations team. Likewise, operations often stand-up new pieces of infrastructure without notifying the security team or going through change control; they may not fully harden platforms prior to deployment to “meet a tight deadline”, we’ll come back and patch it later; they may not apply critical patches through lack of a testing environment.

The NOC is measured and compensated for its ability to meet Service Level Agreements (SLAs) for network and application availability, Mean Time Between Failures and application response time. In contrast SOCs are measured on how well they protects against malware; their protection intellectual property and customer data; and ensuring that corporate information assets aren’t misused. The business driver behind both of these is to manage business risks – in a NOC, for instance, the loss of revenue or compensation for breach of an SLA; in a SOC, regulatory fines or loss of customer confidence.

NOCs are about availability and performance, SOCs are about security. Even with the best intentions, having the team responsible for availability and performance make decisions about incident response and the application of controls that will, invariably, impact on the availability and performance of services (even if it is just through the diversion of human resources), is never going to work well.

NOCs and SOCs certainly should be in close co-ordination. One of the best ways of achieving this is to ensure the NOC has a view on of the SIEM platform. I’ve seen SOCs react to “large scale Distributed Denial of Service attacks” that have been the result of legitimate traffic after the launch of a new service, and I’ve seen subtle patterns detected by alert NOC analysts result in uncovering wide-scale penetrations within organisations. When it comes to actually responding to a confirmed incident, operations and information security must work hand-in-hand to investigate, contain, eradicate and recover from the attack with appropriate and proportionate responses. Working together in a collaborative manner as a part of an incident response team, a SOC and NOC help ensure that right balance.

A well-implemented collaboration strategy between a NOC and SOC should identify that the SOC’s function is to analyse security issues and to recommend fixes and then the NOC analyses the impacts of those fixes on the
business, makes recommendations on whether to apply the fix, makes the appropriate approved changes and then documents those changes.

2. The skills needed in, and the responses required from, a NOC analyst
and SOC analyst are vastly different

NOC analysts require a proficiency in network, systems and application engineering, whereas SOC analysts require skills in security engineering. The tools and processes used for monitoring and investigating events also differ, as does the interpretation of the data they produce: A NOC analyst may interpret a device outage as an indicator of hardware failure, while a SOC analyst may interpret that same event as evidence of a compromised device. Likewise, using the example I gave above, high bandwidth utilisation will cause the NOC to take steps to ensure availability, in contrast the SOC may first question the cause of the traffic spike, the reputation of it’s origin and correlations against other known attacks.

One of the biggest differences between a SOC and a NOC is that a SOC is looking for “intelligent adversaries” as opposed to naturally occurring system events such as network outages, system crashes and disk failures. While these naturally occurring systen events can, in fact, be caused by the actions of “intelligent adversaries”, their concern is about the restoration of the quality of service as soon as possible – even if this involves the destruction of evidence that would allow the investigation of the cause.

3. Staff attrition is waaaaaay worse in a SOC

Level 1 SOC Analysts, those responsible for the triage of incoming events burn out with often alarming regularity. The average tenure of a Level 1 SOC Analyst is typically less than two years and can be as high as 20% per annum. In contrast the tenure and turnover of NOC staff is typically much better.

This attrition within a SOC needs to be planned for with a suitable feeder pool of new candidates and an effective on-boarding training scheme to teach them about the use of the SIEM platform, the analytical skills need to investigate incidents and internal procedures. Developing a career progression plan for your analysts will also allow you to retain these valuable resources within your business, potentially moving them to security engineering or incident response positions.

Despite everything I’ve said above it is possible to run an effective coverage SOC/NOC, but it can take more effort, operational expense and better governance than running them as separate functions. The potential benefits can lie through the introduction of a single point-of-contact for all security and operational issues, as well as the tight integration between those who discover and react to information security incidents, and those who have to deploy and manage the mitigations post event. Whether you choose to keep the functions separate or integrate them, it is important to understand the differences between the functions.

University of Washington Introduction to Data Science on Coursera

I’m currently working through the excellent Introduction to Data Science course from the University of Washington on Coursera.

I would say that the course requires a reasonable knowledge of Python programming in order to handle most of the assignments, but there are plenty of resources out there to help you with this too.  Overall the video lectures are well delivered with excellent references given for addition reading.

The main topics covered are:

  • Introduction to Data Science;
  • Relational Databases, Relational Algebra;
  • MapReduce;
  • NoSQL;
  • Statistics;
  • Machine Learning;
  • Visualisation; and
  • Graph Analytics

The course runs over 8 weeks with around 12 hours a week needed to complete the lectures and assignments.   The University of Washington haven’t announced the dates for the next intake, but in the past it’s been around May.

B-Sides Manchester 2014: What #SOCFail looks like, and how to avoid it: AKA sort your “little” data out before going BIG

533dd4_05b32d465d6e4a1ca3f2a59d07b3625c.jpg_srz_p_239_207_75_22_0.50_1.20_0I’m presenting at B-Sides Manchester in Track 2 at 13:00 next week on on 27 June 2014.

The presentation takes a humorous view at what #SOCFail looks like, focusing on the top 10 mistakes made by organisations building Security Operations Centres.  In addition, we’ll discuss how to avoid these pitfalls, discuss what a good SOC looks like and list some emerging trends in event detection, investigation and response.

State of Security Operations Report

Today our practice has puSecOpsRepblished the first State of Security Operations Report, which looks at the trends, best practicesand key capabilities we’ve observed in over

Security Operations Maturity Assessments we’ve conducted over the past few years.

The Security Operations Maturity Assessment looks at over 160 difference aspects of business alignment, governance, people, process and technology involved in running an effective and efficient Security Operations Centre.  The report highlights both the positive trends and common mistakes we’re seeing across all vertical markets and territories.

We’re hoping that the report will become an annual release.

 

 

 

5