Mar 26 ’13

Sensitive Data: Knowing What to Protect and How Best to Protect It

by Barry Schrager in Enterprise Tech Journal

Securing all your sensitive data is an overwhelming concept for many data center managers, security officers and privacy officers. To a great extent, they would prefer to ignore it. Decades of migrated data have been swept under the rug; this mountain of data sets represents the dirty little secret nobody wants to address. The fact is, most security staffs can’t tell you where all the sensitive data resides, which means they just can’t protect data they can’t locate and categorize.

The Legacy Data Challenge

The designers of the original IBM System/360 series hardware and software should be congratulated for their foresight. Mainframes have been around a long time, all the while maintaining upward compatibility. Unfortunately, that also means data has been kept around, too, and some of it contains sensitive information. Even the best data security teams can’t protect what they don’t know about.

An easy response is that, because of the mainframe’s isolation, only insiders even have the potential to access this data. However, a 2012 InformationWeek/Dark Reading Strategic Security Survey asked, “Which of these possible sources of breaches or espionage pose the greatest threat to your company in 2012?” Respondents could select three categories; their response was:

• 52 percent: Authorized users or employees
• 52 percent: Cybercriminals
• 44 percent: Application vulnerabilities
• 24 percent: Public interest groups/hacktivists
• 21 percent: Service providers, consultants and auditors.

Note that insiders (authorized users or employees) are perceived to be a threat equal to cybercriminals.

Internal Threats

These criminal insiders may not have started out this way, but they can become vulnerable and eventually compromised by criminals. This subversion need not even be overt. A subtle look the other way is often enough to enable a massive breach. For example, in December 2011, the Manhattan District Attorney indicted 55 individuals for their participation in an organized identity theft and financial crime ring. This cybercrime breach involved cooperation from corrupt employees at banks, a non-profit institution, a high-end car dealership and a real estate management company. The defendants acquired and sold the names, dates of birth, addresses, Social Security numbers and financial account information of unsuspecting victims. (For further details, see The New York Times report at www.nytimes.com/2011/12/16/nyregion/indictment-expected-against-55-in-cybercrime.html?_r=1.)

Various studies over the years have identified insiders as a top cause of data breaches. They have access to production data as needed to perform their usual activities and much non-production data at varying levels of access. Many savvy organizations fail to consider that insiders also have access to old development test data sets, which were necessary for development and testing, but never deleted. There’s also data accumulated from mergers and acquisitions that resides unprotected or under protected and available to many users.

The three z/OS external security managers—IBM’s RACF and CA’s ACF2 and Top Secret—excel at protecting assets, secrets and personal items that organizations are required to protect, but only if the data security staff knows to do so. Unknown sensitive data sets aren’t properly protected simply because they’re not in the production data and the people responsible don’t know what sensitive data is in them.

Historical practices of the development and Quality Assurance (QA) teams as well as normal system users have caused an extensive accumulation of data over the years. When creating new application programs or modifying existing ones, programmers commonly used copies of actual production data to test against. It was too dangerous to directly use the actual production data and difficult to generate accurate data to simulate the real thing. QA professionals have typically taken the same short cut.

Today, there are tools and automated systems for de-identification of data to ensure the development and test organizations never use actual production data containing sensitive information. But these policies are relatively new and can be easily circumvented when teams are dealing with a massive effort or are under a deadline to get an application completed and tested on time.

Footprint Left Behind

Consider, too, the footprint these practices have left behind. There’s that one data set squirreled away because it had the best test data and everybody used it. It’s been copied and shared repeatedly through the years. No one person could possibly locate and identify all the copies.

The problem isn’t exclusive to developers. Other employees, as part of their daily responsibilities, perform database queries or generate reports and store the results in their own personal, non-production, data sets.

These are just a few sources of the unknown unknowns. The mainframe has been an upwardly compatible platform for 40 years, so there are up to 40 years of accumulated data. This data may reside on primary or migrated storage and often on tape. How many of these data sets contain sensitive information and how can the data security team properly protect them if they don’t know they exist? What about the data that’s been assimilated by mergers and acquisitions and generated by a whole different staff?

As an example, at one installation, data assimilated as a result of a bank acquisition several years ago was searched to determine if access permissions were appropriate. The result of this project showed that more than 2 million access permissions were inappropriate.

There are two primary areas of concern:

• Production data sets and database tables where the contents contain a different category of sensitive data than the security control definitions assume
• Other data sets and database tables not considered part of the production environment for which the data security administrators don’t have appropriate categorization information and are therefore improperly protected.

This isn’t just a matter of maintaining best practices and minimizing risk. There are many laws governing disclosure of certain kinds of information, which vary from country to country. In the U.S., they include The Health Insurance Portability and Accountability Act (HIPAA), The Health Information Technology for Economic and Clinical Health Act (HITECH), and the Sarbanes-Oxley Act (SOX). Data breaches, which can cause violations of these laws, often have severe repercussions, yet the HIPAA, HITECH, and SOX laws offer little to no guidance on methods for identifying and protecting sensitive data. The European Union has a set of new data privacy rules that are expected to become law by the end of 2013 and carry penalties of up to 2 percent of a company’s annual turnover for a breach.

The PCI Standard

Fortunately, the Payment Card Industry (PCI) has well-defined standards in the PCI Data Security Standards (PCI DSS 2.0). While not law, PCI DSS 2.0 has significant penalties for failure to comply. For example, a company can be considered out-of-compliance if it doesn’t meet the standards during a regular audit; no data breach is required. This standard is quickly becoming a best practice—one that most believe will eventually be the basis for protecting other types of sensitive data such as intellectual property or new product plans.

The first step of a PCI Data Security Standard assessment (PCI DSS) states: “The assessed entity identifies and documents the existence of all cardholder data in their environment” (see https://www.pcisecuritystandards.org/documents/pci_dss_v2.pdf).

This means the first step in achieving PCI compliance, and therefore best practices for any other kind of compliance, is to determine the locations of all the sensitive data. This is reinforced by two statements in the 2008 Verizon Data Breach Report (see www.verizonbusiness.com/resources/security/databreachreport.pdf):

• Sixty-six percent of breaches involved data the victim didn’t know was on the system.
• Knowing what information is present within the organization, its purpose with the business model, where it flows and where it resides, is foundational to its protection.

False Sense of Security

An often-heard response is “My organization is safe. We’ve taken all the right steps to ensure our production environment is protected!” It’s nice and comforting to know that an organization believes all sensitive data in the production data sets and database tables is properly protected. However, this assumes there’s no sensitive data in non-production data sets and no production data sets, or database tables contain different sensitive data, are categorized incorrectly and therefore aren’t properly protected.

For example, credit card numbers have been found in notes fields of a customer call log database, where they should never have been entered. Because of this, either the data sets or database tables must be cleaned up, or the access permissions must be revised to assure they fit with the newly discovered sensitivity of the data.

A common response from some sites is that they’ve outsourced their sensitive data. Outsourcing data doesn’t relieve the organization of liability and many users still have general access to the data even though it resides on the outsourcer’s equipment. The same requirements and responsibilities to protect the data still apply, and the complexity is often increased. That’s because now the outsourcer’s employees and contractors may also have access to this data as part of their routine support and maintenance activities.

Some Remedies

Virtually all installations have sensitive data. The first step is identifying and locating that data. Once the data sets and database tables containing this sensitive information have been identified, there are several different remedies:

• If you don’t need it, get rid of it! Review your organization’s data retention policy, which is usually tied to the creation date or the date when the data was last accessed or referenced. Fortunately, z/OS maintains both the creation date and last referenced date for all data sets and an organization can make decisions based on this information. Be sure the actual process used to determine which data sets contain sensitive information doesn’t reset the last referenced dates or, alternatively, obtain and record this information before reviewing them. (If the dates are reset, you will have lost this valuable remediation information.) Then create a list of data sets that haven’t been referenced since before the data retention policy timeframe. These are the data sets that can be either deleted or encrypted and archived to tape. When deleting all these data sets, the z/OS feature, erase-on-scratch, should be used to assure there’s no residual sensitive data remaining on the storage devices that someone could accidentally access.
• Determine when data sets were last accessed. Learn which of the remaining data sets potentially containing sensitive information haven’t been accessed in the recent past—say six months or a year. Many organizations like to use 13 to 15 months to include information referenced only once per year. Then, remove all access privileges, encrypt and migrate them. These data sets shouldn’t be deleted since there’s no assurance that a user, manager or auditor, for a legitimate business purpose, won’t need to access them again. You can bet someone will eventually try to legitimately gain access to several of these data sets. By keeping them in the catalog, as would be in the case of migrated data sets, they’re still “accessible.” While this will inconvenience a few users, the data will be protected until it can be more closely evaluated and the access privileges adjusted accordingly.
• Address data sets that are in “current” use. View the data set and validate that the data set does, in fact, contain sensitive information. It’s common for any kind of automated product, or manual process for that matter, to generate “false positives.” If upon examination and research, the data set doesn’t actually contain sensitive information, it’s not a vulnerability and need not be further considered for remediation. A good process will include the ability to flag such data sets as having false positives, document what was done to ensure the false positives were false, and provide an expiration date to allow the data set to be revisited at an appropriate future date.

If a “currently used” data set contains sensitive information, the data owner must be consulted to determine if the sensitive information is really necessary based on how it’s being used. If not, it can be cleansed by deleting all occurrences of the sensitive information or masking the values so the data set becomes no longer sensitive. If the data set contains sensitive test data that needs to retain its characteristics to provide valid tests, the sensitive data can be overlaid with similar data that passes the validity checks, but doesn’t really reference the sensitive data. If the data’s sensitive meaning must remain accessible, a tough decision must be made with the data owner about how processes and application programs can be modified to protect the sensitive information to the greatest extent possible and maintain compliance with the applicable regulations and laws.

Tokens and Encryption

A common approach is to replace the sensitive data with tokens or encrypt it. Both techniques typically require modification to the processes and application programs to decrypt or look up the token value to use it.

If the processing program can’t be modified easily to encrypt/decrypt/tokenize the individual fields or even the entire record as it’s processed, a potential alternative solution is to modify the batch Job Control Language (JCL) to add a decryption step that will write the data set to a Virtual Input/Output (VIO) data set, then use that for input/update by the application program, reversing the process when access to the data set is finished.

The VIO data set is preferred over a temporary “scratch” data set because it resides in the paging storage of the system and not in an actual data set on the disk storage system, as a normal temporary data set would. This means that if there’s an abnormal termination of the program, it won’t be left around for someone else to find.

Conclusion

Today’s legal and regulatory environment demands that organizations identify all the locations of this sensitive data, remediate the data that can be deleted or encrypted, and determine if the access permissions for the sensitive data are appropriate.