3.5 - Applications of OS Protection

3.5 - Applications of OS Protection

Copyright(c), 1990, 1995 Fred Cohen - All Rights Reserved

OS Protection Models:

In order for a policy to be enforced, it is usually supported by a model that maps the policy into a representation which can be readily understood in terms of an information system. A model usually includes specifications of explicit data structures and techniques for using these structures to enforce policy. It tends to show a correspondence between mathematical policy concepts and enforcement mechanisms. A typical model that depends on a subject/object policy to define allowable information flows usually requires some way of identifying and authenticating the identities of subjects, or the distinctions of policy will not be meaningful. A separation mechanism that keeps track of objects and partitions them from each other is usually required in order to make the policy distinctions between objects meaningful. Some way of implementing the protection mechanism so that it is uncircumventable is necessary in order to assure the integrity of the protection mechanism. Some way to assure that the mechanism is properly implemented is required, or it may fail to properly implement the desired policy.

Keeping track of objects usually involves a separation mechanism that assures that resources are partitioned into non-overlapping time and/or space segments, each of which is initialized on first use, erased upon removal, and marked with identifiers that are only accessible by the OS [Klein83] . Subject identities are generally checked through mechanisms involving some form of authentication. Hardware support is usually depended upon for keeping the OS uncircumventable. Assurance is usually provided by open design and implementation, proof of correctness where necessary, and extensive testing. [Landwehr83]

Protection is generally modeled by an authorization mechanism which grants or denies object rights to subjects. In order for such a mechanism to be effective, we must first be able to establish the identity of a subject in a reliable manner. This is usually done by requiring an authentication of the subject's to the OS.

Identification and Authentication

Identification and authentication is often characterized as being a function of what the user knows, what the user has, and/or what the user is [Abrams77] . Historically, passwords have dominated this field, but more recently there has been some interest in the use of algorithmic authentication [Cohen85] .

In the classical authentication scheme, each user uses a fixed string to authenticate their identification to the OS. At 'login', the OS asks for the user's identification and authentication string, and the user enters it. The computer optionally encrypts the password, and checks it against the OS copy. If the strings agree, the user's identity is assumed to be authentic. A major problem with this scheme is that it is easy to write a 'password' checking program that quickly tries selected passwords.

Many system authentication programs have been designed to stop the authentication process when they reach the first wrong character of the password. Attackers have exploited this error in design by placing the first character of a proposed password on a different page of memory from the rest of the password, and determining whether or not the second page was ever accessed by checking paging statistics. Once the first character is attained, the password is realigned so the first and second characters are on the first page, and so on. This makes the attack time linear in the number of symbols in the password and the number of different characters per symbol. The cure is to force all authentications to perform the same number of equally timed operations before returning either a successful or failed response.

Assuming the password authentication program is properly designed, the number of attempts required to find a password by exhaustive search averages 1/2 the number of possible passwords (given that the password is chosen at random from all possible passwords). Given that the password is of length 'l', and is made of symbols from a symbol set containing 's' symbols, the number of attempts will average (s**l)/2. For an average time per trial of 't' time units, the time till entry is (ts**l)/2.

For short passwords and trial times, this can be very quick. For example, if a password is made of 4 digits (e.g. the 'PIN' number associated with most current automated bank teller systems), the average number of attempts is only 500. Since an attempt takes only about 10 seconds, we expect that a bank account could be broken into every 5,000 seconds, or about every 80 minutes. In fact, banking machines allow only about 8 attempts before invalidating a card, so it takes 60 cards on the average before a successful attack can be launched. Many ATM users write their PINs on the card to help them remember, just as many computer users write down their passwords. This of course makes attack quite trivial. In most computer systems, one attempt can be performed (by a program on the system) in under .1 seconds. This makes numerical passwords of length 6 attackable in under 20 hours. Most password schemes allow numbers and letters, which increases the complexity of attack considerably.

Another major problem is that users tend to associate their passwords with some aspect of their life, use common words from the dictionary, or use very short sequences of characters. This is very easily attacked through the use of selected plaintext, and the attack times for these cases are easily determined. For passwords using common dictionary words (or their concatenation), the number of possible combinations of a given length are considerably smaller than in cases where any sequence of characters may occur. Most computer systems currently in use have on line dictionaries for spelling correction and other such purposes. For a dictionary containing 'w' words, and passwords made up of at most 'k' words from that dictionary, the number of tries required for an average attack comes to only (w**k)/2, and the attack time is (tw**k)/2.

On the average, people use only about a 10,000 word vocabulary, and for passwords of length 8 or less, the average number of words per password is only about 1.5. Using .1 seconds per attack as an estimate, the attack time comes to under 14 hours, compared to the average attack time for 8 character passwords using random characters which is over 3,300 years. In a study done using just such a program at Berkeley [Ritche82] , the average attack times were found to be about 3 hours on a typical Unix system.

In cases where personal data about the user is available and personal association is used in the creation of the password, a considerably smaller dictionary results. The effect is a drastic reduction in attack times. In the recent attack launched through the UCLA computers, the system administrator was able to guess the password of an attacked user in only a few minutes after examining personal information about the user [Reid83] .

In a study performed by the author about 1 in 10 users were found to have trivially guessed passwords (first or last name spelled forwards or backwards) even though all users were warned in writing before receiving their accounts to use hard to guess passwords. Another study [Highland86] indicates that 21% of users have 1 character passwords, and 92% of users have passwords of 4 or fewer characters. Assuming this is a typical environment, it is clear that many computer systems could be infiltrated to a high degree in a very short time. The real attacker is also unlikely to be tracked down because the users whose accounts had been illicitly accessed would be identified by the system in an audit.

Some simple protection mechanisms are available to limit the usefulness of the plaintext authentication attack. It is simple to write a program that checks the length of passwords to verify that they are of at least a minimal length and not a combination of dictionary words in the computer's on line dictionary. In essence, this is an automatic way to verify that the user doesn't act foolishly in the choice of passwords. The algorithm which verifies passwords can also be made to have long run times by using a complex one way encryption and adding delays to the system. This serves to increase attack times, but as delays become significant, they become unacceptable to the users of most systems.

This simple password augmentation improves protection against simple plaintext attack to a significant degree, but it doesn't protect against other attacks like the observation of the password. Another problem is that the same authentication string is usually used throughout the user's authorization space so that any authorized activity is granted through a single initial authentication. By requiring unique authentications for various authorizations, it is possible to force an attacker to observe many authentications before attaining access to all authorized operations of a given subject.

Multiple authentication increases the time over which an attacker must observe the user before attaining full access and has the advantage that over a given session the subject may only sacrifice a limited subset of privileges to the attacker. If a user's authorized resources are divided into P partitions such that each requires a different authentication, then it takes at least P observations to gain access to all authorized resources. This also provides limited protection from attacks where the attacker takes control of the user's I/O device(s) after authentication has been performed, since only a subset (1/P'th) of the user's authorization is granted at any given time.

A problem with all of the above authentication schemes is the simplicity of a 'Trojan horse' attack. By writing a program that observes input and simulates the authentication process, authentication data can be gathered and directly used to gain access. Similarly, a passive tapper can observe these processes and reuse the data. There are two basic properties of this scheme that make it vulnerable to attack:

P1 - The use of static data for authentication.
P2 - The unidirectionality of the authentication process.

The algorithmic authentication scheme presented here [Cohen85] avoids both P1 and P2, and thus avoids most of the problems with current authentication systems. An 'algorithmic authentication scheme' is a scheme where the authentication of a user to a system and a system to a user is based on their knowledge of an algorithm rather than their memory of data. The plaintext of the 'key' is never transmitted, and both parties can authenticate the others' identity.

1 - The user (u) and the computer (k) share a private transforms T
2 - u and k generate pseudo-random bit sequences (Bu and Bk)
3 - u sends Bu to k, and k sends Bu to u
4 - u and k perform T(Bk) and T(Bu) to get Zu and Zk
5 - u sends a subsequence of Zu to k, and k sends a subsequence of Zk to u
6 - u and k verify the correctness of Zk and Zu respectively
7 - If 6 is correct, 5 and 6 are repeated until u and k are both satisfied
8 - If 6 fails, the failure detector sends random numbers and denies access
Detailed analysis of this and closely related schemes appear in [Cohen85] .

Authorization

The association of rights to authenticated identifications grants authority to subjects for operations on objects. The process of granting authority is called authorization. There are several standard authorization methods used primarily to control information flow.

The most common method is the combination of a separation mechanism, an access control structure, and a means for checking each access against the structure. Separation mechanisms are usually implemented by hardware which prohibits physical access to resources except by processes operating in the secure state. Most OSs demand that the only processes running in the secure state are those which are part of the OS itself. When there are multiple levels of secure states within the hardware, the most privileged is usually responsible for the protection mechanism. The means for checking each access and the access control structure are generally kept in the most secure state because of their criticality to secure operation, while other OS code is generally excluded from this state in order to assure that only the access control mechanisms can observe or modify themselves and their structures. Thus all accesses to system resources and the protection mechanism are controlled by the protection mechanism, and the mechanism is uncircumventable except through its own failings.

The Bell-LaPadula [Bell73] , Biba [Biba77] , and Compartment [Klein83] policies are generally modeled in secure system data structures by keeping security, integrity, and compartment markings for each subject and object in the system, and implementing a provably correct routine that grants or refuses system calls on the basis of the markings. Another common technique is the use of lists of subjects with access to various objects (i.e. access control matrices or access lists). Lattices and POsets are fairly straight forwardly modeled with matrices [Denning75] [Cohen86] .

A relatively recent variation on the subject/object model of protection was used to describe and demonstrate properties of the POset protection policy. In this model, subjects are not explicitly differentiated from objects. Rather there are a set of 'information domains' (i.e. domains) and a flow relation 'f' which specifies the flow between domains. Instead of a subject/object matrix, we have a flow control matrix in which the flow right either exists or doesn't exist from domain to domain. The POset policy is any set of flows that can be specified in an upper triangular matrix [Cohen86] under this model. Subjects are dealt with as collusions of domains, and a collusion theory which is a special case of the more general theory, is used to analyze effects of subjects and collusions thereof.

An interesting alternative to access matrices is capabilities [Dennis66] . Capabilities are unforgeable tokens that may be exchanged for rights to perform operations. As an example, the subway token (if properly implemented) is an unforgeable token that may be exchanged for the right to enter the subway system. In an information system, these tokens are usually informational in makeup, and their unforgeability is typically guaranteed by the use of cryptographic techniques. A fundamental problem with capability based systems is the fact that they are information in the hands of the user. An attack against such a system is demonstrated by the following scenario.

Suppose that Alice has a capability C for reading or writing information at a given B-L security level 1, and Bob has a capability D of reading information at level 1 or level 2, but only writing information at level 2. Alice has C, and can therefore write a level 1 data item which contains a copy of C. Bob can read the data item containing C with capability D, and thus attain a copy of the token for capability C which then allows Bob to write level 1 information. Thus the capability system has to be modified to have the OS enforce the maintenance and use of capabilities, which degenerates capabilities into a case of access lists except that the capability system is based on an 'irrevocable' right to perform operations (because the capability is granted by virtue of holding the token, not by virtue of a system held entity. Revocation of rights presents a number of problems because it is not feasible to reverse the time effects of information flow, and because once held, a token cannot be reliably taken away.

In many cases, policies allow a certain degree of user control over access (i.e. discretionary access control (DAC)), in which certain subjects have rights to grant or take certain rights over certain objects to certain subjects. This is usually implemented by the addition of an 'ownership' right which grants the owner DAC rights if and only if the mandatory access control (MAC) rights don't restrict the access. In most modern computer systems, only DAC rights are implemented.

There are no well defined mathematical models for service availability at this time other than those defined by the field of fault tolerant computing wherein a given availability can be attained for an entire system [Siewiorek82] . Several policies in this area have been explored to a limited extent, but no well accepted one has yet been proposed.

Implementations:

Although it is possible to properly implement a protection mechanism in any general purpose computer system, it is very difficult if there are no support mechanisms which provide a separation between the enforcement mechanism and applications. Most systems, a notable exception being most personal computers, provide hardware support for separating the OS from applications (see references in [Landwehr83] ). They are typically designed to assure that accidental protection violations don't occur, since these can cause substantial harm (typically destruction of the file system, denial of services, or unpredictable behavior), and are often able to withstand serious attacks if the OS is properly designed and implemented.

Databases as Limited Functionality Operating Systems:

Databases are essentially special purpose operating systems that are designed to operate with limited functionality. Aggregate information flow is often allowed (i.e. aggregate results of computations may be made available) even though the basis for the inference of aggregated results must be kept from flowing freely. The problem of allowing aggregation while preventing access to portions of the basis is called 'inference control', and is very similar to the limitation of access to 'intelligence indicators', in that the release of some aggregate information may indicate to an attacker through indirect means, information that is not supposed to be revealed. As a typical example of inference from aggregate information, suppose we allow a database user to know the total salary of all employees, but not the individual salaries of any employees. When a new employee is added, the new total of all salaries indicates the new employee's salary directly. Much more subtle inferences are possible, and in general, the ability to perform the NOR function (or any other set of functions that form a complete set of relations on a set is sufficient to eventually reveal the basis of inferred data). Many good references are available in the area of database protection, and these are well summarized in [Denning82] and [Fernandez81] .

From the standpoint of policy, databases are essentially the same as operating systems. The major difference lies in the limited functionality often afforded by databases. As we noted earlier, there are three 'laws' of information in general purpose, transitive information systems with sharing. In the case of databases, we generally don't have a general purpose system, and thus these laws don't necessarily apply. We may be able to limit transitivity or sharing with relatively easy to compute methods. We may be able to enforce a conservation policy for information that causes one subject to lose information when another gains it, and thus we may be able to use conservation based traces of information flow.

A database may be defined as a centralized collection of mutually related operational data stored more or less permanently in a computer, and used in one or more applications by some particular enterprise. This definition is a combination of the definitions offered by [Date75] [Ullman82] , [Wiederhold77] , and [Katzan75] . Since each of these writers offered a slightly different emphasis, we felt it important to formulate a comprehensive definition. A data model is the "...information content of the database as it is viewed by the users." [Date75] (p11) The users very probably think that the data model is itself the database, and with hierarchical databases this is actually the case. A data submodel is some specific subset of the data model. A payroll organization, for example, may believe that the database consists of an employee name, social security number, address, salary, tax information, and deductions, when in reality it contains more information (e.g. employee performance ratings and history).

Defining data submodels creates a kind of natural security. A user can access only the information that has been included in the submodel, and may not even know that any other data exists. It is also possible to define allowable operations within a submodel so that the Data Base Administrator (DBA) can use submodels to control what information a user can access and what operations can be performed on that information. An 'arbiter' is a mechanism that checks each user request and either grants or denies it depending upon the constraints that have been set for the user in question.

There are 3 traditionally accepted types of data models:

1 - relational data models,
2 - hierarchical data models, and
3 - network data models

In practice it is simple to think of a relational database as a table that adheres to the following rules [Date75] (pp43-44):

1 - No two rows are identical.
2 - The ordering of rows is important.
3 - The ordering of columns is important.
4 - Every value is atomic.

In order for data to be of use, it must be current and meaningful; it must be accessible to those users with the need and the right to use it; and it must be protected from unauthorized access. To provide a means to meet this need, database sublanguages have been set up to perform certain operations. A properly authorized user may have to; read stored data, insert new data, delete old data, update data when a portion of it changes, and drop a entire data model when it is no longer needed. While each data model has its own way of dealing with these operations, each makes an attempt to both make data accessible to the properly authorized user, and protect data from users with no right to operate on it.

Query languages have been developed for use with relational databases and are important tools in adding controls to data submodels. The 'query by example' (QBE) language provides a means to authorize specific operations on specific fields, and it can even limit access to a particular value of a field (e.g. "I.AUTR()..I" allows whatever user appears in to do whatever operations appear in . The operations available are (I)nsert, (D)elete, (U)pdate, and (P)rint. Placing a variable in a column allows to perform the stated operation(s) on that column. A constant placed in a particular column grants that right to only the rows containing that constant. A blank in any column denies access to the information in it.

"A hierarchical model uses the nest concept as its basic unit. Trees are created by nesting nests further, and forests are created by collecting trees. The data model and the database model are identical" [Wiederhold77] (p366). The hierarchical data model is often referred to as a physical database. In IMS, the IBM hierarchical database language, each physical database has a 'Data Base Descriptor' (DBD) which describes the tree structure of the database and other pertinent information like the size of fields and access methods (e.g. ISAM, HISAM, etc). Accompanying a physical database are one or more logical databases. A logical database is equivalent to a submodel; in place of a DBD it has a 'Program Communication Block' (PCB). The PCB contains only those segments that are available to a particular user (SENSEGS), and it specifies what operations that user is allowed to perform (PROCOPTS). In short, it has the same protection features that are found in a relational data submodel. The addition of on line capabilities to hierarchical sublanguages (IMS in particular) allows us to restrict database access to particular input and output devices (lterms). It also makes it possible to perform identification, authentication, and authorization tests at the time of a request.

Other parallels can be drawn between security measures for hierarchical databases and those for relational databases. We have already seen that although QBE and IMS use different approaches, they accomplish essentially the same result by hiding nonessential data and specifying allowable operations. Hierarchical databases can also use the same access constraints as relational databases, and arbiter programs are an essential part of on line access controls.

The 'Data Base Task Group' (DBTG) of the 'Committee on Data Systems Languages' (CODASYL) is largely responsible for network structured databases. The schema or data model of a network database is similar to the data model of a hierarchical database, but where the tree structure of the hierarchical database limits a record to having one parent, a record in a network database may have multiple parents with the record forming a link between its parents. The 'schema' of a network database structure corresponds to the DBD of the hierarchical database. The subschema of a network database performs the same functions as the PCB of the hierarchical database, but it has some additional features. Fields in a subschema can be renamed so that it becomes more difficult to link it to the schema from which it originated. 'privacy locks' and 'privacy keys' are another additional feature. If a privacy lock is specified in a schema or subschema, its accompanying key must be included in any programs that attempt to access data from that subschema or the access will not be allowed. In effect, the network database uses its schema, subschema and 'Data Manipulation Language' (DML) to perform access constraint functions and to serve as its own arbiter program.

In all the above cases the purpose of the database was to store data that is to be accessed directly by the user. This situation does not usually occur with statistical databases. A statistical database can be relational, hierarchical, or network [Denning82] , but is different in the way it's used. The data in most databases is released in its stored form; the information stored in a statistical database is used to calculate statistics which are released to some set of users. The stored data remains confidential. This confidential data and whatever external knowledge a user might have combine to make up the information state of the statistical database. In order to calculate statistics on information stored in a database, it is necessary to use a characteristic formula, (i.e. a logical formula using attributes from the database and logical operations). The information retrieved in answer to a characteristic formula is called a query set.

Since stored data is to remain confidential, the issue of sensitive statistics becomes an important one. A sensitive statistic is one which reveals too much confidential information about a specific person, group, etc. If the query set size is 1, the statistic will always be sensitive since the query set will contain the confidential, stored data. A query set of size 2 may also be sensitive since if the user has external knowledge about one member of the set can be used to infer specific information about the other member. If query sets have overlaps, sets of overlaps can be inferred and used to infer smaller and smaller subsets of the database, eventually revealing the basis for the statistical data.

One criteria for determining the safe size of a query set is the 'n-respondent, k%-dominance rule' established by the U.S. Census Bureau. This rule says that any statistics are sensitive if n or fewer members make up more than k percent of the query set. Statistical disclosure (also termed residual disclosure, personal disclosure, or compromise) occurs whenever a user is able to combine external knowledge with the set of released statistics to obtain restricted information. If disclosure occurs and the user has no supplementary knowledge, it is called resultant disclosure; while disclosure with supplementary knowledge is called external disclosure. Exact disclosure means that the user can obtain specific, accurate, confidential information. Approximate disclosure occurs when the disclosure is not exact.

If a database is to provide perfect secrecy, none of its released statistics can be sensitive. Since all statistics contain some information about the data, perfect secrecy can never be attained (unless no information from the database is ever released). A database can be called secure if none of the statistics that are disclosed are sensitive. If all of the disclosed statistics equals all of the nonsensitive statistics, the database is said to be precise.

Statistics have traditionally been released in two ways: macro statistics and micro statistics. Macro statistics are tables containing counts for the (row,column) pairs, with each column footed with the sum of the entries in that column. The problem with macro statistics is that they reveal an extremely limited subset of the available statistics. Micro statistics, on the other hand, are the individual data records. They are usually read from the database and stored on tape or disk so that they can be used by programs to compute statistics for release. Unlike macro statistics they are not ready for release prior to being accessed by some intermediate process, since this would expose confidential information.

By having external knowledge a user can compromise a statistic by creating characteristic formulae that have small query sets and contain the information to which the user has access. By eliminating what he already knows, he can gain additional confidential information. In order to more fully demonstrate the problems involved with maintaining a secure statistical database, Denning describes five methods of attacking them [Denning82] . These methods are small and large query set attacks, tracker attacks, linear system attacks, median attacks, and insertion and deletion attacks.

Large query sets become a problem when the query language permits complementation. If a user has knowledge about one subject in the database, he can obtain knowledge about other subjects by complementing characteristic formula comprised of his external knowledge and the subject that he is trying to compromise.

To help protect against large and small query set attacks query-set-size controls have been established. For each database some number n should be found so that no query sets are permitted with fewer than n or more than N-n entries (where N is the size of the database). Note that n cannot be greater than N/2 or no statistics will be released.

Tracker attacks involve using external knowledge and characteristic formulae whose query set sizes are within specifically defined ranges. Individual, general, double, and union trackers are used to circumvent query-set size controls. Once the user has found a tracker, he can use it to obtain additional restricted information.

Linear system attacks are perpetrated by transposing a query set into a system of equations. Solving the system for a specific entry can compromise the information for that entry.

Median attacks pose multiple queries that each use some external knowledge and the statistic median. To use this attack it is necessary to find two query sets that have the same median and one record in common. These sets can be found by repeated retrieving of median query sets.

Insertion and deletion attacks are damaging to dynamic databases. Query-size controls can be circumvented by adding records that satisfy a particular characteristic formula. Newly added records can be compromised by retrieving a query set immediately before and immediately after the insertion. The difference between the two query sets can be attributed to the newly added record. Any record already existing in a query set can be compromised by observing changes to some statistic when pairs of records are added to or deleted from the query set.

In order to afford protection for confidential information, it may be necessary to apply some restrictions to the statistics themselves. One technique for accomplishing this restriction is cell suppression. Statistics are released in two-dimensional tables. Through cell suppression, all sensitive statistics are eliminated from the tables, and in order to be sure that the waters are sufficiently muddied, an additional group of nonsensitive statistics, called complementary suppressions, is eliminated as well. Another way statistics can be compromised is through implied queries. Implied query sets are query sets that can be derived from known query sets. These implied query sets may be sensitive even if the query sets from which they have been derived are not. To prevent disclosure of confidential information it is necessary to restrict nonsensitive statistic if their query sets or implied query sets are outside of the range imposed by the query set size controls previously described.

Another perturbation technique involves partitioning the database into disjoint groups such that:

1 - Each group G has g = |G| records,
        where g = 0 or g >= n and g is even.
2 - Records are added to or deleted from G in pairs.
3 - Query sets must include entire groups.
If the query set for a statistic includes one or more records from each of m groups G(1),...,G(m), then the query set 'q(G(1)+...+G(m))' is released [Denning82] (p369).

Confidential data can also be protected by adding noise to statistics. One way to add noise is by response perturbation which involves applying some function (usually rounding) to a statistic before it is released. Random sample queries protect data by not allowing the user to control what records will be queried, and guarantee that different records will be included in each statistic that is released. Data perturbation is achieved by applying some function to the data values themselves. The stored data can be perturbed data, or it can be stored in its original form and the perturbation applied prior to its use in calculating statistics. Data swapping involves exchanging values among the records so that information cannot be deduced about any particular record.

The randomized response technique can be used when data is gathered. With this technique the person surveyed answers pairs of questions without revealing which answer goes with which question. One of the answers must be one whose probability can easily be calculated when the data is correlated. The remaining answers will be attributed to the sensitive statistic. While this technique is not completely accurate, it may be more accurate than if the sensitive question had been asked directly since people sometimes feel embarrassed to admit the truth about sensitive issues.

In addition to database specific considerations, access control decisions are also important to the protection of information stored in databases. Denning points out that such decisions may be; data dependent, where certain users may see only those records which contain specified values in specified fields; time dependent, where some records may be accessible only during certain hours; context dependent, where a user may be allowed two separate sets of information, but may not be allowed to see the relation between the two sets; and history dependent, where current capabilities depend on previous states of the system [Denning82] .