Information Retrieval & Fuzzy Evaluation

Next: Conclusion Up: Knowledge & Data Representation Previous: Formal Description

Information Retrieval & Fuzzy Evaluation

As it is expected, information stored in the forms explained in the preceding section, has to be retrieved for various purposes. Query and deletion is two of them. Now the questions is: Given a form of criteria how is the solution set formed? But before answering this question a more essential question has to be posed. What kind of operations are possible in a query?

A query is made over any attribute of any relation in the database.
Three comparison operators are provided (implemented) as:
- `=' The `equality' test.
- `<' The `lesser' test.
- `>' The `greater' test.
The left hand side of the comparison is the attribute name subject to the query, and the right hand side is a constant value. This value will be allowed to be a fuzzy constant if that attribute is a fuzzy one.
The logical combinations of two queries by the operators
- AND
- OR
is also a valid query. In this case the queries joined by these logical operators will be called `sub-queries'.

The possibility that the queries can be over fuzzy attributes and that logical operations are allowed, requires that all the results coming form the sub-queries, and hence the queries, must carry a fuzzy information. Furthermore, some binary fuzzy set operations must be defined as well. Therefore the system is designed such that all the query results are fuzzy sets. The logical AND operations that join two sub-queries will cause an intersection of the result sets coming from these sub-queries. Similarly a OR operation means a union of the result sets of the sub-queries.

A fuzzy set is not much different than a normal set. It is a collection of some entities. The difference lies in the concept of `being an element of the set'. In the classical set theory an entity is either the member of a set or not. In the fuzzy set theory one speaks of the degree of being a member of a set. So, each entity has a `membership value' (which will be abbreviated as MV, from now on) concerning a fuzzy-set which tells how strong it is a member of the set. We will assume (without loss of generality) that this MV is represented by a real value in the range [0,1]. The fuzzy approach incorporates the classical sets. A classical set can be considered as a fuzzy set where all of its members have MVs that are equal to 1. Any entity that is not a member of the set has an MV of 0. On the following pages we introduce the MV values defined by means of functions over the numerical domains of the fuzzy concept (e.g. if the fuzzy concept is `distance' then the domain is distances in kilometers). As said, these functions are mappings into the real numbers in the range [0,1] which are the MV values.

The two basic operations defined on classical sets, namely union () and intersection () have their counterparts in the fuzzy set theory, as well. As expected these definitions for the fuzzy sets reduces to the ordinary union/intersection operations if the sets involved have elements with MVs of 1 only. The fuzzy union of two fuzzy sets A and B is defined as:

Similarly, the fuzzy intersection of two fuzzy sets A and B is defined as:

Of course at implementation time it is nonsense to hold the entities with MV of 0, in any result set. These entities are just discarded.

So, the fuzzy evaluation constitutes of two main parts.

evaluation of the comparison(s)
joining the results of the sub-queries (if there exist more than one) by the appropriate fuzzy set operations union/ intersection.

The first entry should be explained for its various possibilities. The left hand side of the comparison is the attribute name and the right hand side is a constant value. Where the comparison operation is one of `=', `<', `>'. Consider a specific query of this form. If this query is over an attribute that may admit fuzzy values, than it is true that

The user can have entered a fuzzy instance value as well as a non-fuzzy instance value on the RHS.
For each database entry this attribute can have a fuzzy value or a non-fuzzy value.

For simplicity we will call non-fuzzy values as crisp. So, all the four combinations are possible:

The system is so designed that for each entry in the database the attribute value and the query value are submitted to an algorithm which determines the MV value, for this pair. This algorithm acts in three different ways for the following three categories:

Both are crisp.
One is crisp the other is fuzzy.
Both are fuzzy.

Below the behaviors of the algorithm for these categories are described:

Both are crisp :

For the `<' and `>' operators the values are compared, and depending on this result a MV of 1 or 0 is returned.

For the `=' case we have the option of using fuzzy equality. `equality' is interpreted as `being about'. Following the definition in [], a function that defines MV for `about' is introduced:

The above defined `about' function has a bell-like shape around which will be compared to x. c is a constant which is determined ampirically for each fuzzy attribute. This constant serves to adjust the mildness or tolerance of `being about' in other terms this is the sharpness of the peek of the beel-like shape.

MV that stays below TV (TV is the Threshold Value and is used to serve as a comparison value to accept/reject a solution. This concept is explained below in more detail) are taken as 0.

EXAMPLE

As an example let us look at the fuzzy attribute of distance. Our implementation has the following constant setting for distance:

Now assume the query is something like

and the TV was set to 0.75 before hand. Let us look at five different data:

One is crisp the other is fuzzy :

In this case we have a fuzzy value (far, short, tall). For each of these fuzzy values, an empirically determined function is defined.[] These functions are mappings from the numerical value domain (the fuzzy value is about), onto the range [0,1] of real numbers (which stands for the MV values). So, for example, if is the function associated with `shortness', and yields

we would say that the height value: 1.55 meter is a member of the fuzzy set `short' with an MV value of 0.7.

This explains what is done for the `equality' test: The deterministic value is plugged into the specific function associated with the fuzzy value, and the MV returned by the function is used. Now, it is quite possible that some evaluations will yield results which are significantly small, hence a mechanism has to exist to reject this type of MV values. For this purpose a cut-off value (we will call it the `Threshold Value', and abbreviate it by TV) is introduced. Threshold value is a user defined value to reject or accept tuples according to their membership values. Any MV value that cannot exceed the TV will be taken as 0 (zero) and so that specific element will have no chance to exist in the fuzzy result set. This means TV should be less than or equal to MV.

For the `<' and `>' comparisons the evaluation scheme changes. Let us take the `<' comparison first. Assume the query was of the form

(Like: salary<high)

and a candidate value is known. The algorithm will proceed as follows:

Find the area that is under the curve in the range
Find the area that is under the whole curve, i.e. in the range
Calculate and compare this value with the TV. If it exceeds the TV, accept it and return it as the MV value for , otherwise reject this candidate.

The idea behind this can be phrased as:

The resulting MV of the comparison is the sum (integral) of the MV values for that specific fuzzy value that are in the range not satisfying the comparison criteria divided by the whole area under the MV curve.

For the `>' comparison, the process is similar. Assume the query was of the form

(Like: salary>low)

and a candidate value is known. The algorithm will proceed as follows:

Find the area that is under the curve in the range
Find the area that is under the whole curve, i.e. in the range
Calculate and compare this value with the TV, if it exceeds the TV accept it and return it as the MV value for , otherwise reject this candidate.

Let us clarify what we are saying by an example: EXAMPLE

We will be looking at one of the above given query line

Lets have a look at some data of this category available, and list the values of the salary attribute (Ofcourse only deterministic values are listed, the others will be considered in the following part.)

We define

If we integrate analyticaly and we get

Here is exactly the smallest x value where the high function `touches' 1 ie. for the first time. In this example of our implementation . Furthemore the new introduced symbol is a constant and is exactly

Since we will be talking very soon about the `total area under the high(x) curve' we have to clarify the concept of infinity here. Any such function as high(x) is defined to explain some World-phenomena. Hence we shall restrict ourself to the real domain which steams from the actual problem. If we are talking about people that make their living on the base of salaries earned monthly we cannot say that the highest possible value for a salary earned is simply infinity. There is a realistic suppremum for salaries. So we clarify our term `infinity' (in the sense of the upper limit of an integration) to be suppremum salary. In our implementation this was taken to be 40 and is symbolicaly indicated by above. In this sense the `total area under the high-salary curve' is calculated as

Now look at the first sample data. Namely 8, our task is to find the area under the high(x) curve in the range . Bearing in mind the above discussion we conclude that we actualy have to obtain the area for the range [8,40]. Making use of the obtained integral above, we can write

eqnarray1920

This leaves us with the calculation of . Using the above give analytic result we obtain

This value is exactly the area under the high(x) curve that remains to the right of the (x=8) value. To obtain the MV value, following the procedure described above, this value has to be divided by `total area under the curve' which is in our case 31.8652.

This value is compared to the TV (Treshold Value) which is set to 0.75, and since (0.977>0.75) we accept this data to be a solution. Without going into the calculation details we tabulate the results for all the example data.

Both are fuzzy :

All the comparison cases are treated similarly. For each comparison there exist a `comparison matrix' which we invent as a generalized version of the `similarity matrix' concept. A similarity matrix is the way to express two fuzzy values are. For each entry of the Cartesian product of this `possible fuzzy values' set with itself, a real number (MV) in the range [0,1] is associated. Such a mapping can be represented by a matrix (where N is the cardinality of the mentioned set). The two indices of the matrix take their values from this set, namely the possible fuzzy values. (e.g. far, near).

In this work this concept of `assigning an (MV) value for each pair of fuzzy values' which tells us `how equal' these two fuzzy values are has taken further to cover also other comperative operations like `being greater' or `being lesser'. So we have a matrix which tells us `how equal', for example, the distance fuzzy values are. Also we have another matrix that tells us `how greater' the distance fuzzy values are among themselves. The entries of these matrices are precalculated, using a similar technique to the one that is described by []. This value is retrieved from the matrix and compared to the TV. If it exceeds than the candidate is accepted with that MV else rejected.

The procedure to calculate the entries of these matrices (once and for all) differs for the operations `equality' and `being lesser/greater'.

Case 1: Equality

In this case we seek to determine the value

where and are fuzzy values about a domain . For example we may seek to find where we are looking at the height domain. We know that any fuzzy value is introduced by a function which maps some real world values of the domain to the real numbers in the range [0,1]. So each has an associated function

We define tha answer to as

In other words, if , and represent the areas that lay under the functions , and respectively, then

Note that with this definition `equality' is not symmetric, because is usually not equal to

Case 2: Less/Greater

Assume is one of the comparison operators > or <. Now the question is to define

The implemented method distinguishes two cases as:

is true :
In this case, without any hasitation, the MV value is defined as 1.
is false :
Interestingly, in this case MV is a priory defined as 0. Rather it is defined as

It is worthwhile to point out the fact that with this definition the `comparison matrix for <' is the transpose of the `comparison matrix for >'

Study of the functions for the fuzzy values led to the following `similarity' and `comparison' matrices.

HEIGHT (equality)

HEIGHT (greater)

Next: Conclusion Up: Knowledge & Data Representation Previous: Formal Description

Meltem TURHAN
Wed Oct 30 10:42:28 EET 1996