Wednesday, August 21, 2019
Higher Quality Input Phrase To Driven Reverse Dictionary
Higher Quality Input Phrase To Driven Reverse Dictionary Implementing a Higher Quality Input Phrase To Driven Reverse Dictionary E.Kamalanathanà and C.Sunitha Ram ABSTRACT Implementing a higher quality input phrase to driven reverse wordbook. In contrast to a conventional forward wordbook, that map from word to their definitions, a reverse wordbook takes a user input phrase describing the specified construct, and returns a group of candidate words that satisfy the input phrase. This work has important application not just for the final public, notably those that work closely with words, however conjointly within the general field of abstract search. The current a group of algorithms and therefore the results of a group of experiments showing the retrieval accuracy and therefore the runtime latency performance is implementation. The experimental results show that, approach will offer important enhancements in performance scale while not sacrificing the standard of the result. Experiments scrutiny the standard of approach to it of presently on the market reverse dictionaries show that the approach will offer considerably higher quality over either of the opposite presently on the market implementations. Index Terms : Dictionaries, thesauruses, search process, web-based services. . INTRODUCTION A Report work on creating a reverse dictionary, As against a regular (forward) wordbook that maps words to their definitions, a WD performs the converse mapping, i.e., given a phrase describing the required conception, it provides words whose definitions match the entered definition phrase. Itââ¬â¢s relevant to language understanding. The approach has a number of the characteristics expected from a strong language understanding system. Firstly, learning solely depends on unannoted text information, which is abundant and contain the individual bias of an observer. Secondly, the approach is predicated on all-purpose resources (Brillââ¬â¢s PoS Tagger, WordNet [7]), and also the performance is studied below negative (hence additional realistic) assumptions, e.g., that the tagger is trained on a regular dataset with doubtless totally different properties from the documents to be clustered. Similarly, the approach studies the potential advantages of victimization all potential senses (and hypernyms) from WordNet, in an endeavor to defer (or avoid altogether) the necessity for Word Sense Disambiguation (WSD), and also the connected pitfalls of a WSD tool which can be biased towards a particular domain or language vogue BACKGROUND WORK Natural Language Processing: Natural Language Processing (NLP) [6] is a large field which encompasses a lot of categories that are related to this thesis. Specifically NLP is the process of computationally extracting meaningful information of natural languages. In other words: the ability for a computer to interpret the expressive power of natural language. Subcategories of NLP which are relevant for this thesis are presented below. WordNet: WordNet [7], [2]is a large lexical database containing the words of the English language. It resembles the traits of a thesaurus in that it structures words that have similar meaning together. WordNet is something more, since it also specifies different connections for each of the senses of a given word. These connections place words that are semantically related close to one another in a network. WordNet also displays some quality of a dictionary, since it describes the definition of words and their corresponding part-of-speech. Synonym relation is the main connection between words, which means that words which are conceptually equivalent, and thus interchangeable in most contexts, are grouped together. These groupings are called synsets and consist of a definition and relations to other synsets. A word can be part of more than one synset, since it can bear more than one meaning. WordNet has a total of 117 000 synsets, which are linked together. Not all synsets have a distinct path to another synset. This is the case, since the data structure in WordNet is split into four different groups; nouns, verbs, adjectives and adverbs (since they follow different rules of grammar). Thus it is not possible to compare words in different groups, unless all groups are linked together with a common entity. There are some exceptions which links synsets cross part-of-speech in WordNet, but these are rare. It is not always possible to find a relation between two words within a group, since each group are made of different ba se types. The relations that connect the synsets within the different groups vary based on the type of the synsets. Application Programming Interface Several Application Programming Interfaces (API) exists for WordNet. These allow easy access to the platform and often additional functionality. As an example of this the Java WordNet Library [8] (JWNL) can be mentioned. This allows for access to the WordNet Library files. PoS Tagging PoS tags[8] are assigned to the corpus using Brillââ¬â¢s PoS tagger. As PoS tagging require the words to be in their original order this is done before any other modifications on the corpora. Part-of-speech (POS) tagging is the field which is concerned with analysing a text and assigning different grammatical roles to each entity. These roles are based on the definition of the particular word and the context in which it is written. Words that are in close proximity of each other often affect and assign meaning to each other. The POS taggers job is to assign grammatical roles such as nouns, verbs, adjectives, adverbs, etc. based upon these relations. The tagging of POS is important in information retrieval in general text processing. This is the case since natural languages contain a lot of ambiguity, which can make distinguishing words/terms difficult. There are two main schools when tagging POS. These are rule-based and stochastic. Examples of the two are Brillââ¬â¢s tagger and Stanford POS tagger, respectively. Rule-based taggers work by applying the most used POS for a given word. Predefined/lexical rules are then applied to the structure for error analysis. Errors are corrected until a satisfying threshold is reached. Stochastic taggers use a trained corpus to determine the POS of a given word. Stopword Removal Stopwords, i.e. words thought not to convey any meaning, are removed from the text. The approach taken in this work does not compile a static list of stopwords, as usually done. Instead PoS information is browbeaten and all tokens that are not nouns, verbs or adjectives are removed. Stop words are words which occur often in text and speech. They do not tell much about the content they are wrapped in, but helps humans understand and interpret the residue of the content. These terms are so generic that they do not mean anything by themselves. In the context of text processing they are basically just empty words, which only takes up space, increases computational time and affects the similarity measure in a way which is not relevant. This can result in false positives. Table: 1 List of Stop words This class includes only one method; which runs through a list of words and removes all occurrences of words specified in a file. A text file, which specifies the stop words, is loaded into the program. This file is called ââ¬Å"stop-words.txtâ⬠and is located at the home directory of the program. The text file can be edited such that it only contains the desired stop words. A representation of the stop words used in the text file can be found in table 1. After the list of stop words has been loaded, it is compared to the words in the given list. If a match is found the given word in the list is removed. A list, exposed from stop words, is then returned. Stemming Words with the same meaning appear in various morphological forms. To capture their similarity they are normalised into a common root-form, the stem. The morphology function provided with WordNet is used for stemming, because it only yields stems that are contained in the WordNet dictionary. This class contains five methods; one for converting a list of words into a string, two for stemming a list of words and two for handling the access to WordNet through the JWNL API[8]. The first method listToString() takes an ArrayList of strings and concatenate these into a string representation. The second method stringStemmer() takes an ArrayList of strings and iterates through each word, stemming these by calling the private method wordStemmer(). This method checks if the JWNL API has been loaded and starts stemming by looking up the lemma of a word in WordNet. Before this is done, each word starting with an uppercase letter is checked to see if it can be used as a noun. If the word can be used as a noun, it does not qualify for stemming and is returned in its original form. The lemma lookup is done by using a morphological processor, which is provided by WordNet. This morphs the word into its lemma, after which the word is checked for a match in the database of WordNet. This is done by running through all the specified POS databases defined in WordNet. If a match is found, the lemma of the word is returned, otherwise the original word is simply returned. Lastly, the methods allowing access to WordNet initializes the JWNL API and shuts it down, respectively. The initializer() method gets an instance of the dictionary files and loads the morphological processor. If this method is not called, the program is not able to access the WordNet files. The method close() closes the dictionary files and shuts down the JWNL API. This method is not used in the program, since it would not make sense to uninstall the dictionary once it has been installed. It would only increase the total execution time. It has been implemented for good measure, should it be needed. Stemming[5] is the process of reducing an inflected or derived word to its base form. In other words all morphological deviations of a word are reduced to the same form, which makes comparison easier. The stemmed word is not necessarily returned to its morphological root, but a mutual stem. The morphological deviations of a word have different suffixes, but in essence describe the same. These different variants can therefore be merged into a distinct representative form. Thus a comparison of stemmed words turns up a higher relation for equivalent words. In addition storing becomes more effective. Words like observes, observed, observation, observationally should all be reduced to a mutual stem such as observe. PROPOSED SYSTEM Reverse dictionaries approach can provide significantly higher quality. The proposed a set of methods for building and querying a reverse dictionary. Reverse dictionary system is based on the notion that a phrase that conceptually describes a word should resemble the wordââ¬â¢s actual definition, if not matching the exact words, then at least conceptually similar. Consider, for example, the following concept phrase: ââ¬Å"talks a lot, but without much substance.â⬠Based on such a phrase, a reverse dictionary should return words such as ââ¬Å"gabby,â⬠ââ¬Å"chatty,â⬠and ââ¬Å"garrulous.â⬠Forward mapping (standard dictionary): Intuitively, a forward mapping designates all the senses for a particular word phrase. This is expressed in terms of a forward map set (FMS). The FMS of a (word) phrase W, designated by F(W) is the set of (sense) phrases {S1, S2, . . . Sn } such that for each Sj à ââ¬Å¾ F(Wi), (Wi à ¯Ãâà Sj) à ââ¬Å¾ D. For example, suppose that the term ââ¬Å"jovialâ⬠is associated with various meanings, including ââ¬Å"showing high-spirited merrimentâ⬠and ââ¬Å"pertainingâ⬠to the god Jove, or Jupiter.â⬠Here, F (jovial) would contain both of these phrases. Reverse mapping (reverse dictionary): Reverse mapping applies to terms and is expressed as a reverse map set (RMS). The RMS of t, denoted R(t), is a set of phrases { P1, P2, Pi,â⬠¦Ã¢â¬ ¦, Pm}, such that à ¯Ã¢â ¬Ã ¢Pi à ¯Ã¢â ¬Ã à ¯ÃâÃ
½ R(t), t à ¯ÃâÃ
½ F(Pi). Intuitively, the reverse map set of a term t consists of all the (word) phrases in whose definition t appears. The find candidate words phase consists of two key sub steps: 1) Build the RMS. 2) Query the RMS. A. COMPONENTS The first preprocessing step is to PoS tag the corpus. The PoS tagger relies on the text structure and morphological differences to determine the appropriate part-of-speech. For this reason, if it is required, PoS tagging is the first step to be carried out. After this, stopword removal is performed, followed by stemming. This order is chosen to reduce the amount of words to be stemmed. The stemmed words are then looked up in WordNet and their corresponding synonyms and hypernyms are added to the bag-of-words. Once the document vectors are completed in this way, the frequency of each word across the corpus can be counted and every word occurring less often than the pre specified threshold is pruned. Stemming, stopword removal and pruning all aim to improve clustering quality by removing noise, i.e. meaningless data. They all lead to a reduction in the number of dimensions in the term-space. Weighting is concerned with the estimation of the importance of individual terms. All of these have been used extensively and are considered the baseline for comparison in this work. However, the two techniques under investigation both add data to the representation. a PoS tagging adds syntactic information and WordNet is used to add synonyms and hypernyms. B. BUILDING REVERSE MAPPING SETS The input phrases sentence is split into words and then removes the stop words ( a, be, person, some, someone, too, very, who, the, in, of, and, to) if any appears, and find other words, which is having same meaning from the forward dictionary data sources. Given the large size of dictionaries, creating such mappings on the fly is infeasible. Thus, Procreate these Rs for every relevant term in the dictionary. This is a one time, offline event; once these mappings exist, we can use them for ongoing lookup. Thus, the cost of creating the corpus has no effect on runtime performance. For an input dictionary D, we create R mappings for all terms appearing in the sense phrases (definitions) in D. C. RMS QUERY This module responds to user input phrases. Upon receiving such an input phrase, we query the R indexes already present in the database to find candidate words whose definitions have any similarity to the input phrase. Upon receiving an input phrase U, we process U using a stepwise refinement approach. We start off by extracting the core terms from U, and searching for the candidate words (Ws) whose definitions contain these core terms exactly. (Note that we tune these terms slightly to increase the probability of generating Ws) If this first step does not generate a sufficient number of output Ws, defined by a tuneable input parameter à ±, which represents the minimum number of word phrases needed to halt processing and return output. D. CANDIDATE WORD RANKING In this module sorts a set of output Ws in order of decreasing similarity to U, based on the semantic similarity. To build such a ranking, we need to be able to assign a similarity measure for each (S,U) pair, where U is the user input phrase and S is a definition for some W in the candidate word set O. Wn and Palmerââ¬â¢s Conceptual similarity, WUP Similarity between concepts a and b in a hierarchy, Here depth(lso(a,b)) is the global depth of the lowest super ordinate of a and b and len(a,b) is the length of the path between the nodes a and b in the hierarchy SOLUTION ARCHITECTURE We now describe our implementation architecture, with particular attention to design for scalability. The Reverse Dictionary Application (RDA) is a software module that takes a user phrase (U) as input, and returns a set of conceptually related words as output. Figure 1. Architecture of reverse dictionary. The user input phrase, split the word from the input phrase, perform the stemming. Predict every relevant term in the forward dictionary data source. In the generate query. input phrase, minimum and maximum output thresholds as input, then removal of level 1 stop words ( a, be, person, some, someone, too, very, who, the, in, of, and, to) and perform stemming, generate the query.Execute the query find the set of candidate words. Finally sort the result based on the semantic similarity EXPERIMENTAL ENVIRONMENT Our experimental environment consisted of two 2.2 GHz dual-core CPU, 2 GB RAM servers running Windows XP pro and above. On one server, we installed our implementation our algorithms (written in Java). The other server housed is wordnet dictionary data. CONCLUSION We describe the many challenges inherent in building a reverse lexicon, and map drawback to the well-known abstract similarity problem. We tend to propose a collection of strategies for building and querying a reverse lexicon, and describe a collection of experiments that show the standard of our results, similarly because the runtime performance underneath load. Our experimental results show that our approach will give important enhancements in performance scale while not sacrificing answer quality. The higher quality input phrase to driven reverse dictionary. Unlike a traditional forward dictionary, which maps from words to their definitions, a reverse dictionary takes a user input phrase describing the desired concept, it reduce the well-known conceptual similarity problem. The set of methods building a reverse mapping querying a reverse dictionary and it produces the higher quality of results. This approach can provide significant improvements in performance scale without sacrificing solution quality but for larger query it is fairly slow. REFERENCES T. Dao and T. Simpson, ââ¬Å"Measuring Similarity between Sentences,â⬠2009. http://opensvn.csie.org/WordNetDotNet/trunk/ Projects/ T. Hofmann, ââ¬Å"Probabilistic Latent Semantic Indexing,â⬠SIGIR ââ¬â¢99: Proc. 22nd Ann. Intââ¬â¢l ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 50-57, 1999. D. Lin, ââ¬Å"An Information-Theoretic Definition of Similarity,â⬠Proc .Intââ¬â¢l Conf. Machine Learning, 1998. M. Porter, ââ¬Å"The Porter Stemming Algorithm,â⬠http://tartarus.org/martin/PorterStemmer/ , 2009. G. Miller, C. Fellbaum, R. Tengi, P. Wakefield, and H. Langone, ââ¬Å"Wordnet Lexical Database,â⬠http://wordnet.princeton.edu/wordnet/download/, 2009. P. Resnik, ââ¬Å"Semantic Similarity in a Taxonomy: An Information-Based Measure and Its Application to Problems of Ambiguity in Natural Language,â⬠J. Artificial Intelligence Research, vol. 11, pp. 95- 130, 1999. AUTHORS PROFILE E Kamalanathan is pursuing his Master of Engineering (part time ) from Department of Computer Science and Engineering, SCSVMV University Enathur,
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.