Introduction
To facilitate the sharing of software applications across diverse disciplines, it is often desirable to develop a mechanism to automatically collect software distributed on the Web and make them easily accessible by users. Currently, the most widely used method is keyword matching. However, understanding precisely the intention of users based on search keywords still remains as a major challenge. Fortunately, recent studies in semantics show that semantics can help extend our understanding of the original content. For example, by using ontology in search, especially in organizing resources and understanding search words, search engines can better understand the meaning of various concepts and produce more targeted outputs [].
Previous studies show that good correlation between computed relatedness scores and human judgments can be achieved by using vectors of Wikipedia].
In this paper, we employed similar ideas in software search by using Wikipedia-based concept vectors to specify software applications. A semantic index is also built using this concept space. With the help of this fusion strategy, information from multiple sources can be unified to achieve a more complete understanding towards a specific software application, compared to its original description information. During the search process, the relatedness between software applications (represented as concept vectors) is calculated to rank the search results.
The contribution of this paper is twofold. First, we presented extended concept space construction (ECSC), a novel approach to forming an extended semantic representation of the original concept based on Wikipedia. This method can be also applied to other tasks requiring semantic information without building a comprehensive semantic system. Second, we applied ECSC in the task of software search and achieved promising results compared to traditional software search engines in terms of the consistency with the search results produced by human users.
The rest part of this paper is organized as follows. The strategy for extending the concept space of software for indexing and semantic search is detailed in Sect. with some directions for future work.
Semantic Search Using ECSC
It is well known that useful semantic information can be extracted by analyzing the log files of search engines to provide better search services. However, in many cases, the log files are not made accessible to the public. Instead, a more practical approach is to use publically available knowledge resources such as Wikipedia to acquire the semantic information to improve the search experience. More specifically, given a software description, a concept space (a vector of concepts) is constructed to represent its essential attributes (e.g., what each term is about and what it is).
For example, the description of Weka (Waikato Environment for Knowledge Analysis) may read as A suite of machine learning software developed by the Machine Learning Group at University of Waikato, containing a collection of popular machine learning algorithms for data mining tasks. Ideally, a user that is interested in a certain data mining algorithm should be able to retrieve this software record using the name of the algorithm or even its alias as the keyword, even if it does not explicitly appear in the original description. In other words, the software search engine should be able to achieve a more general understanding of the software on top of its existing description.
In Wikipedia, each article is treated as a concept and each concept is represented by a vector of words that occur in the article. The strength of association between words and concepts can be computed by WLVM (Wikipedia Link Vector Model) [ contains an implementation of WLVM, which was used in our work to compute the relatedness between concepts.
Using ECSC, each software application is annotated as a set of concepts (original concept space) and each concept is mapped into a weighted sequence of Wikipedia concepts ordered by their relevance (i.e., use Wikipedia concepts to augment the bag of concepts that represent the software) using WikipediaMiner. Concepts with low levels of relatedness are discarded. Finally, all sets of concepts are merged into a vector of concepts (extended concept space) as the new software description. When searching for software, the semantic relatedness between two software applications is calculated by comparing their vectors of concepts, for example, using the cosine metric. In the meantime, with the help of ECSC, for the same input keyword, it is now possible to retrieve software records that are closely related but otherwise would not have been possible to be retrieved. The overall process of ECSC is shown in Fig..
Fig. 1
A diagram of the overall process of ECSC
To demonstrate how our approach works, the top 10 individual Wikipedia concepts most relevant to a given concept (e.g., SVM and Bayesian probability) is shown in Table a Python open source machine learning library.
Table 1
The top 10 concepts most relevant to SVM and Bayesian probability
# | SVM | Bayesian probability |
---|
| Statistical classification | Bayes theorem |
| Decision tree learning | Frequency probability |
| Nave Bayes classifier | Prior probability |
| Supervised learning | Decision theory |
| Perceptron | Principle of indifference |
| Linear classifier | Probability interpretations |
| Kernel methods | Statistical inference |
| Multiclass classification |