-
Notifications
You must be signed in to change notification settings - Fork 3
On the Use of Information Theory to Bound the Effectiveness of Unsupervised Software Retrieval
Theoretically, software requirements are able to be transformed into multiple forms of information such as source code, test cases, or design artifacts. We are referring to those requirements or any initial/raw form of information as the source artifacts. On the other hand, the information that is a product of a transformation or alteration is considered a target artifact. In the software engineering context, a transformation could be any action or intervention that a software engineer applies to those requirements. Therefore, the concept of "transmission of information" can be seen as the programming task per se or any generative process. Such a generative process is producing the target artifacts.
The main idea of using Information Measures with Information Retrieval (NATHAN: need to specify the type of IR this formulation works for and which it does not, i.e., works for traceability, but not for coupling/impact set recovery) is to determine and quantify the boundaries of the effectiveness of Unsupervised Techniques. In other words, we want to demonstrate that data does not speak by themselves. Therefore, information retrieval techniques are insufficient approaches to produce reliable traceability links (or any other relationships among artifacts). However, we can also determine to what extent is the information lost during the transmission process (Requirements to Source Code) as well as to what extent is the information noise in target artifacts. The information noise might be relevant for security purposes (i.e., why a given source code is not covered by the requirements).
Information can be transmitted by copy or transformation. If we think about information in software systems, the information in the requirements X is transmitted by transformation f(X). The function f(X) is any generative process that uses the source information and produces an information outcome Y. The outcomes are software artifacts product of a generative process such as source code, test cases, or design diagrams.
There are 5 information measures that complete the information manifold:
- Self-Information of source artifacts I(X). This is the entropy of a set of source artifacts or requirements.
- Self-Information of target artifacts I(Y). This is the entropy of a set of target artifacts or source code.
- Mutual Information I(X:Y). This is the amount of information that X can see of Y and the amount of information that Y can see of X.
- Loss I(X|Y). This is the amount of information that comes into the "channel" but it does not come out (or never depicted in the target). Another way to say this is it is what has not been implemented in the system that was in the requirements.
- Noise I(Y|X). This is the amount of information that comes out (or found in the target) but it does not come in (or never depicted in the source). Another way to say this is it is additional functionality implemented into the system that wasn't in the requirements.
Additionally, we introduce the concept of "Minimum Shared of Information" for entropy and extropy.
- Minimum Shared of Information for Entropy. This is the minimum number of tokens shared between the source and target set represented as entropy.
- Minimum Shared of Information for Extropy. This is the minimum number of tokens shared between the source and target set represented as extropy.
Manifold Analysis for Entropy Measures
Scatter Matrix for Minimum Shared Entropy/Extropy
Manifold Entropy Measures Distributions
Shared Information Distributions
Manifold Entropy by Ground Truth
Shared Information by Ground Truth
Word2Vec Precision-Recall-ROC
Doc2Vec Precion-Recall-ROC
Word2Vec Precision-Recall-Gain [WMD]
Corr WMD vs MSI-I
Corr WMD vs MSI-X
Corr WMD vs MI
Mutual Information - WMD Group by Loss
Mutual Information - WMD Group by Noise
Manifold Analysis for Entropy Measures
Scatter Matrix for Minimum Shared Entropy/Extropy
Manifold Entropy Measures Distributions
Shared Information Distributions
Manifold Entropy by Ground Truth
Shared Information by Ground Truth
Word2Vec Precision-Recall-ROC
Doc2Vec Precion-Recall-ROC
Word2Vec Precision-Recall-Gain [WMD]
Corr WMD vs MSI-I
Corr WMD vs MSI-X
Corr WMD vs MI
Mutual Information - WMD Group by Loss
Mutual Information - WMD Group by Noise
Test