辅导案例-Q1

The details of Q1 implementation 1. __init__ function Four variables tf_tokens, tf_entities, idf_tokens and idf_entities are respectively initialized to none,which is convenient for the use of the following functions.And the following results will verify these four indicators. 2. index_documents function First, traverse the documents, take out the corresponding text, and connect it as a string. Deal with it with Spacy, and take out the corresponding entity and token respectively. The dict variable is used to save the corresponding doc_ID and frequences, and the token is the same. However, restrictions such as is_stop, is_punch and single word need to be added.Index the entity first, because no other factors need to be considered. To index the token, we need to filter the token of stop and punct, and invalidate the token that appears in entity. First find the token TF and entity TF, and then find the corresponding IDF. 3. split_query function First, define a queries array to hold queries corresponding to different splits. First, select the eligible entity in doe. The second step is to name all the combinations of entities. The third step is to eliminate the frequency of query in all combinations and select the eligible entity combinations. Finally, according to the matching entity combination, the corresponding token and query are obtained. 4. max_score_query function According to the query obtained in the previous step, the corresponding token and entity sets are calculated respectively. For the token set, TF IDF of each token is calculated by the corresponding TF and IDF calculation methods, and the accumulation is saved by S2. Similarly, TFIDF corresponding to each entity is calculated, and the accumulation is saved with S1. Finally, S1 and S2 are respectively given corresponding weights and added to S. Get the query with the largest s, and save the corresponding s and query as result.

辅导案例-Q1

Related

Previous Post辅导案例-COMP9020-Assignment 2

Next Post辅导案例-1COMP 5416-Assignment 2

Author admin