-
Notifications
You must be signed in to change notification settings - Fork 3
On the Use of Information Theory to Bound the Effectiveness of Unsupervised Software Retrieval
Theoretically, software requirements are able to be transformed into multiple forms of information such as source code, test cases, or design artifacts. We are referring to those requirements or any initial/raw form of information as the source artifacts. On the other hand, the information that is a product of a transformation or alteration is considered a target artifact. In the software engineering context, a transformation could be any action or intervention that a software engineer applies to those requirements. Therefore, the concept of "transmission of information" can be seen as the programming task per se or any generative process. Such a generative process is producing the target artifacts.
Information can be transmitted by copy or transformation. If we think about information in software systems, the information in the requirements X is transmitted by transformation f(X). The function f(X) is any generative process that uses the source information and produces an information outcome Y. The outcomes are software artifacts product of a generative process such as source code, test cases, or design diagrams.
There are 5 information measures that complete the information manifold:
- Self-Information of source artifacts I(X). This is the entropy of a set of source artifacts or requirements.
- Self-Information of target artifacts I(Y). This is the entropy of a set of target artifacts or source code.
- Mutual Information I(X:Y). This is the amount of information that X can see of Y and the amount of information that Y can see of X.
- Loss I(X|Y). This is the amount of information that comes into the "channel" but it does not come out (or never depicted in the target).
- Noise I(Y|X). This is the amount of information that comes out (or found in the target) but it does not come in (or never depicted in the source).
Additionally, we introduce the concept of "Minimum Shared of Information" for entropy and extropy.
- Minimum Shared of Information for Entropy. This is the minimum number of tokens shared between the source and target set represented as entropy.
- Minimum Shared of Information for Extropy. This is the minimum number of tokens shared between the source and target set represented as extropy.
Manifold Analysis for Entropy Measures
Scatter Matrix for Minimum Shared Entropy/Extropy
Manifold Entropy Measures Distributions
Shared Information Distributions
Manifold Entropy by Ground Truth
Shared Information by Ground Truth
Word2Vec Precision-Recall-ROC
Doc2Vec Precion-Recall-ROC
Word2Vec Precision-Recall-Gain [WMD]
Corr WMD vs MSI-I
Corr WMD vs MSI-X
Corr WMD vs MI
Mutual Information - WMD Group by Loss
Mutual Information - WMD Group by Noise
Manifold Analysis for Entropy Measures
Scatter Matrix for Minimum Shared Entropy/Extropy
Manifold Entropy Measures Distributions
Shared Information Distributions
Manifold Entropy by Ground Truth
Shared Information by Ground Truth
Word2Vec Precision-Recall-ROC
Doc2Vec Precion-Recall-ROC
Word2Vec Precision-Recall-Gain [WMD]
Corr WMD vs MSI-I
Corr WMD vs MSI-X
Corr WMD vs MI
Mutual Information - WMD Group by Loss
Mutual Information - WMD Group by Noise
Test