Wednesday, January 28, 2015

Making DevOps Business Driven - a service view

I've been doing a bit recently around DevOps and what I've been seeing is that companies that having been scaling DevOps tend to run into a problem: exactly what is a good boundary for a DevOps team? Now I've talked before about how Microservices are just SOA with a new logo, well there is an interesting piece about DevOps as well, its not actually a brand new thing.  Its an evolution and industrialisation of what was leading practice several years ago.

Back in 2007 I gave a presentation on why SOA was a business challenge (full deck at the end) and in there were two pictures that talked about how you needed to change the way you thought about services:

So on the left we've got a view that says that you need to think about a full lifecycle, and on the right you've got a picture that talks about the needs to have an architect, owner and delivery manager (programme manager)
This is what we were doing around SOA projects back in 2007 as a structure and getting the architects and developers (but ESPECIALLY the architects) to be accountable for the full lifecycle.  Its absolutely fantastic to see this becoming normal practice and there are some great lessons out there and technical approaches.

One thing I've not seen however is an answer to what my DevOps team is and how I manage a large number of DevOps teams.  This is where Business Architecture comes in, the point here is that its not enough to just have lots and lots of DevOps teams, you need to align those to the business owners and align them to the structure that is driving them.  You also need to have that structure so one team doesn't just call the 'Buy from Ferrari' internal service without going through procurement first for approval.

So in a DevOps world we are beginning to realize the full-lifecycle view on Business Services, providing a technical approach to automating and managing services that look like the business, evolve like the business and provide the business a structure where they can focus costs where it delivers the most value.

There is much new in the DevOps world, but there is also much we can learn from the Business Architecture space on how to set up DevOps teams to better align to the business and enable DevOps to scale at traditional complex organisations as well as more simple (from a business model perspective) internet companies.

Tuesday, January 20, 2015

Big Data and the importance of Meta-Data

Data isn't really respected in businesses, you can see that because unlike other corporate assets there is rarely a decent corporate catalog that shows what exists and who has it.  In the vast majority of companies there is more effort and automation put into tracking laptops than there is into cataloging and curating information.

Historically we've sort of been able to get away with this because information has resided in disparate systems and even those which join it together, an EDW for instance, have only had a limited number of sources and have viewed the information only in a single way (the final schema).  So basically we've relied on local knowledge of the information to get by.  This really doesn't work in a Big Data world.

The whole point in a Big Data world is having access to everything, being able to combine information from multiple places within a single Business Data Lake so you can allow the business to create their own views.

Quite simply without Meta-Data you are not giving them any sort of map to find the information they need and help them understand the security required.  Meta-Data needs to be a day one consideration on a Big Data program, by the time you've got a few dozen sources imported its going to be a pain going back and adding the information.  This also means the tool used to search the Meta-Data is going to be important.

In a Big Data world Meta-Data is crucial to make the Data Lake business friendly and essential in ensuring the data can be secured.    Lets be clear here HCatalog does matter but its not sufficient, you can do a lot with HCatalog but that is only the start because you've got to look about where information comes from, what its security policy is, where you've distilled that information to.  So its not just about what is in the HDFS repository its about what you've distilled into SQL or Data Science views, its about how the business can access that information not just "you can find it here in HDFS".

This is what Gartner were talking about in the Data Lake Fallacy but as I've written elsewhere, that sort of missed the point that HDFS isn't the only part of a data lake and EDW approaches only solve one set of problems not the broader challenge of Big Data.

Meta-Data tools are out there, and you've probably not really looked at them but here is what you need to test (not a complete list, but these for me are the must have requirements).
  1. Lineage from source - can it automatically link to the loading processes to say where information came from?
  2. Search - Can I search to find the information I want?  Can a non-technical user search?
  3. Multiple destinations - can it support HDFS, SQL and analytical destinations
  4. Lineage to destination - can it link to the distillation process and automatically provide lineage to destination
  5. Business View - can I model the business context of the information (Business Service Architecture style)
  6. My own attributes - can I extend the Meta-data model with my own views on what is required?
The point of modelling in a business context is really important.  Knowing information came from an SAP system is technically interesting, but knowing its Procurement data that is blessed & created by the procurement department (as opposed to being a secondary source) is significantly more valuable.  If you can't present the meta-data in a business structure you aren't going to get the business users able to use it, its just another IT centric tool.

The advantage of Business Service structured meta-data is that it matches up to how you evolve and manage your transactional systems as well.

Thursday, January 15, 2015

Security Big Data - Part 7 - a summary

Over six parts I've gone through a bit of a journey on what Big Data Security is all about.
  1. Securing Big Data is about layers
  2. Use the power of Big Data to secure Big Data
  3. How maths and machine learning helps
  4. Why its how you alert that matters
  5. Why Information Security is part of Information Governance
  6. Classifying Risk and the importance of Meta-Data
The fundamental point here is that encryption and ACLs provide only a basic hygiene factor when it comes to securing Big Data.  The risk and value of information is increasing and by creating Big Data solutions businesses are creating more valuable and therefore more at risk information solutions.  This means that Information Security needs to become a fundamental part of Information Governance and that new ways of securing that information are required.

This is where Big Data comes to its own rescue through the use of large data sets which enable new generations of algorithms to identify and then alert based on the risk and the right way to handle it.  This all requires you to consider Information Security as a core part of the Meta-data that is captured and governed around information.

The time to start thinking, planning and acting on Information Security is now, its not when you become the next Target or when one of your employees becomes your own personal Edward Snowden, its now and its about having a business practice and approach that considers information as a valuable asset and secures it in the same way as other assets in a business are secured. 

Big Data Security is a new generation of challenges, and a new generation of risks, these require a new generation of solutions and a new corporate culture where information security isn't just left to a few people in the IT department.

Tuesday, January 13, 2015

Securing Big Data Part 6 - Classifying risk

So now your Information Governance groups consider Information Security to be important you have to then think about how they should be classifying the risk.  Now there are docs out there on some of these which talk about frameworks.  British Columbia's government has one for instance that talks about High, Medium and Low risk, but for me that really misses the point and over simplifies the problem which ends up complicating implementation and operational decisions.

In a Big Data world its not simply about the risk of an individual piece of information, its about the risk in context.  So the first stage of classification is "what is the risk of this information on its own?" its that sort of classification that the BC Government framework helps you with.  There are some pieces of information (The Australian Tax File Number for instance) where their corporate risk is high just as an individual piece of information.  The Australian TFN has special handling rules and significant fines if handled incorrectly.  This means its well beyond "Personal Identification Information" which many companies consider to be the highest level.  So at this level I'd recommend having Five risk statuses

  1. Special Risk - Specific legislation and fines apply to this piece of information
  2. High - losing this information has corporate reputation and financial risk
  3. Medium - losing this information can impact corporate competitiveness
  4. Low - losing this information has no corporate risk
  5. Public - the information is already public
The point here is that this is about information as a single entity, a personal address, a business registration, etc.  That is only the first stage when considering risk.

The next stage is considering the Direct Aggregation Risk this is about what happens when you combine two pieces of information together, do that change the risk.  The categories remain the same but here we are looking at other elements.  So for instance address information would be low risk or public, but when combined with a person that link becomes higher risk.  When looking at corporate information on sales that might be medium risk, but when that is tied to specific companies or revenue it could become a bigger risk.  Also at this stage you need to look at the policy of allowing information to be combined and you don't want to have a "always no" policy.

So what if someone wants to combine personal information with twitter information to get personal preferences?  Is that allowed?  What is the policy for getting approval for new aggregations, how quickly is risk assessed and is business work allowed to continue while the risk is assessed? When looking at Direct Aggregation you are often looking at where the new value will come from in Big Data so you cannot just prevent that value being created.  So setting up clear boundaries of where approval is required (combining PII information with new sources requires approval for instance) and where you can get approval after the fact (sales data with anything is ok, we'll approve at the next quarterly meeting or modify policy).

The final stage is the most complex its the Indirect Aggregation Risk that is the risk of where two sets of aggregated results are combined and though independently they are not high risk the pulling together of that information constitutes a higher level risk.  The answer to this is actually to simplify the problem and consider aggregations not as just aggregations but as information sources in their own right. 

This brings us to the final challenge in all this classification: Where do you record the risk?

Well this is just meta-data, but that is often the area that companies spend the least amount of time thinking about but when looking at massive amounts of data and particularly disparate data sources and their results then Meta-Data becomes key to big data.  But lets look just at the security side at the moment.

Data Type Direct Risk
Customer Collection Medium
Tax File Number Field Special
Twitter Feed Collection Public

and for Aggregations

Source 1 Source 2 Source 3 Source 4 Aggregation Name Aggregation Risk
Customer Address Invoice Payments Outstanding Consumer Debt High
Customer Twitter Locaiton Customer Locations Medium
Organization Address Invoice Payments Outstanding Company Debt Low

The point here is that you really need to start thinking about how you automate this, what tools you need.  In a Big Data world the heart of security is about being able to classify the risk and having that inform the Big Data anomaly detection so you can inform the right people and drive the risk.

This gives us the next piece of classification that is required which is about understanding who gets informed when there is an information breach.  This is a core part of the Information Governance and classification approach, because its hear that the business needs to say "I'm interested when that specific risk is triggered".  This is another piece of Meta-data and one that then informs the Big Data security algorithms who should be alerted.

If classification isn't part of your Information Governance group, or indeed you don't even have a business centric IG group then you really don't consider either information or its security to be important.

Other Parts in the series
  1. Securing Big Data is about layers
  2. Use the power of Big Data to secure Big Data
  3. How maths and machine learning helps
  4. Why its how you alert that matters
  5. Why Information Security is part of Information Governance

Monday, January 12, 2015

Securing Big Data Part 5 - your Big Data Security team

What does your security team look like today?

Or the IT equivalent, "the folks that say no".  The point is that in most companies information security isn't actually something that is considered important.  How do I know this?  Well because basically most IT Security teams are the equivalent of the nightclub bouncers, they aren't the people who own the club, they aren't as important as the barman, certainly not as important as the DJ and in terms of Nightclub strategy their only input will be on the ropes being set up outside the club.

If Information is actually important then information security is much more than a bunch of bouncers trying to keep undesirables out.  Its about the practice of information security and the education of information security,  in this Information security is actually a core part of Information Governance and Information Governance is very much a business led thing.

Big Data increases the risks of information loss, because fundamentally you are not only storing more information you are centralizing more information which means more inferences can be made, more links made and more data stolen.  This means that historical thefts which stole data from a small number of systems risk being dwarfed by Big Data hacks which steal huge sets or even runs algorithms within a data lake and steals the results.

So when looking at Big Data security you need to split into three core groups
The point here is that this governance is exactly the same as your normal Data governance, its essential that Information Security becomes a foundation element of information governance.  The three different parts of governance are set up because there are different focuses

  1. Standards - sets the gold standard of what should be achieved
  2. Policy - sets what can be achieved right now (which may not meet the gold standard)
  3. KPI Management - tracks compliance to the gold standard and adherence to policy
The reason these are not just a single group is that the motivations are different.  Standards groups set up what would be ideal, its against this ideal that progress can be tracked.  If you combine Standards groups with Policy groups you end up with Standards which are 'the best we can do right now' which doesn't give you something to track towards over multiple years.

KPI management is there to keep people honest.  This is the same sort of model I talked about around SOA Governance and its the same sort of model that whole countries use, so it tends to surprise me when people don't understand the importance of standards v policy and the importance of tracking and judging compliance independently from those executing.

So your Big Data Security team starts and ends with the Information Governance team, if information security isn't a key focus for that team then you aren't considering information as important and you aren't worried about information security.

Other Parts in the series
  1. Securing Big Data is about layers
  2. Use the power of Big Data to secure Big Data
  3. How maths and machine learning helps
  4. Why its how you alert that matters

Friday, January 09, 2015

Securing Big Data - Part 4 - Not crying Wolf.

In the first three parts of this I talked about how Securing Big Data is about layers, and then about how you need to use the power of Big Data to secure Big Data, then how maths and machine learning helps to identify what is reasonable and was is anomalous.

The Target Credit Card hack highlights this problem.  Alerts were made, lights did flash.  The problem was that so many lights flashed and so many alarms normally went off that people didn't know how to separate the important from the noise.  This is where many complex analytics approaches have historically failed: they've not shown people what to do.

If you want a great example of IT's normal approach to this problem then the ethernet port is a good example.
What does the colour yellow mean normally? Its a warning colour, so something that flashes yellow would be bad right?  Nope it just means that a packet has been detected... err but doesn't the green light already mean that its connected?  Well yes but that isn't the point, if you are looking at a specific problem then the yellow NOT flashing is really an issue... so yellow flashing is good, yellow NOT flashing is bad...

Doesn't really make sense does it?  Its not a natural way to alert.  There are good technical reasons to do it that way (its easier technically) but that doesn't actually help people.

With security this problem becomes amplified and is often made worse through centralising reactions to a security team which knows security but doesn't know the business context.  The challenge therefore is to categorize the type of issue and have different mechanisms for each one.  Broadly these risks split into 4 groups
Its important when looking at risks around Big Data to understand what group a risk falls into which then indicates the right way to alert.  Its also important to recognize that as information becomes available an incident may escalate between groups.

So lets take an example.  A router indicates that its receiving strange external traffic.  This is an IT operations problem and it needs to be handled by the group in IT ops which deals with router traffic.  Then the Big Data security detection algorithms link that router issue to the access of sales information from the CRM system.  This escalates the problem to the LoB level, its now a business challenge and the question becomes a business decision on how to cut or limit access.  The Sales Director may choose to cut all access to the CRM system rather than risk losing the information, or may consider it to be a minor business risk when lined up against closing the current quarter.  The point is that the information is presented in a business context, highlighting the information at risk so a business decision can be taken.

Now lets suppose that the Big Data algorithms link the router traffic to a broader set of attacks on the internal network, a snooping hack, this is where the Chief Information Security Officer comes in, that person needs to decide how to handle this broad ranging IT attack, do they shut down the routers and cut the company off from the world?  Do they start dropping and patching, and do they alert law enforcement.

Finally the Big Data algorithms find that credit card data is at risk, suddenly this becomes a corporate reputation risk issue and needs to go to the Chief Risk Officer (or the CFO if they have that role) to take the pretty dramatic decisions that need to be made when a major cyber attack is underway.

The point here though is that it needs to be systematic how its highlighted and escalated, it can't all go through a central team.  The CRO needs to be automatically informed when the risk is sufficient, but only be informed then.  If its a significant IT risk then its the job of the CISO to inform the CRO, not for every single risk to be highlighted to the CRO as if they need to deal with them.

The basic rule is simple: "Does the person seeing this alert care about this issue? Does the person seeing this alert have the authority to do something about this issue? and finally: does the person seeing this alert have someone lower in their reporting chain who answers 'yes' to those questions?"

If you answer "Yes, Yes, No" then you've found the right level and then need to concentrate on the mechanism.  If its "Yes, Yes, Yes" then you are in fact cluttering if you show them everything that every person in their reporting tree handles as part of their job.

In terms of the mechanism its important to think on that "flashing yellow light" on the Ethernet port.  If something is ok then "Green is good", if its an administrative issue (patch level on a router) then it needs to be flagged into the tasks to be done.  If its an active and live issue it needs to come front and center.

In terms of your effort when securing Big Data you should be putting more effort into how you react than on almost any other stage in the chain.  If you get the last part wrong then you lose all the value of the former stages.  This means you need to look at how people work, look at what mechanisms they use, so should the CRO be alerted via a website they have to go to or via an SMS to the mobile they carry around all the time and that take them to a mobile application on that same device? (hint: its not the former).

This is the area where I see the least effort made and the most mistakes being made, mistakes that are normally "Crying Wolf" so you show every single thing and expect people to filter out thousands of minor issues and magically find the things that matter.

Target showed that this doesn't work.