期刊文献+

一种基于加权LDA模型和多粒度的文本特征选择方法 被引量:10

A Text Feature Selection Method Based on Weighted Latent Dirichlet Allocation and Multi-granularity
分享 导出
摘要 【目的】为改善图书和期刊书目信息的分类性能,结合书目文本的体例结构特点,提出一种基于加权LDA模型和多粒度的文本特征选择方法。【方法】在点互信息(PMI)模型的基础上,结合词性、位置等要素修正特征词的权重并扩展至LDA的生成模型中,以抽取表意性较强的粗粒度特征:结合TF.IDF计算模型采用一定策略获取细粒度特征,基于多粒度特征作为核心特征词集表征书目文本;采用KNN、SVM等算法实现书目文本的分类。【结果】在自建图书、期刊材料上进行分类实验,与LDA方法以及传统特征选择方法相比,该方法分类准确率分别平均提高3.60%和4.79%。【局限】实验材料的数量以及丰富度有待进一步扩展;需探索更多的加权策略模型进行实验,以提高书目文本的分类效果。【结论】实验结果表明,该方法是有效的、可行的,能够提高特征选择后的特征词集对文本的表示能力,从而提高文本分类的准确率。 [Objective] To improve the classification performances of bibliographic information such as books, academic journals, combining with the structure characteristics of bibliography texts, this paper proposes a new feature selection method based on weighted Latent Dirichlet Allocation (wLDA) and multi-granularity. [Methods] On the basis of Pointwise Mutual Information (PMI) model, the method improves the feature weights from the elements of location and part of speech, and extends the process of feature generated by LDA model to get more expressive words. This paper adopts a certain strategy to obtain fine-granularity combined with TF-IDF model and uses multi-granularity features as the core feature sets to represent bibliographic texts. Realize bibliographic texts classification by applying KNN and SVM algorithms. [Results] Compared with the LDA model and traditional feature selection methods, the classification performances on the classifiers of the self-built corpuses for books and journals increase by an average of 3.60% and 4.79%. [Limitations] The experimental materials need to be expanded and more weighted strategies need to be explored to improve the classification performances. [Conclusions] Experimental results show that the method is effective and feasible, and can increase the expressive ability for the feature sets after feature selection, so as to improve the classification effect of text classification.
作者 李湘东 巴志超 黄莉 Li Xiangdong Ba Zhichao Huang Li 1(School of Information Management, Wuhan University, Wuhan 430072, China) 2(Center for the Studies of Information Resources, Wuhan University, Wuhan 430072, China) 3(Wuhan University Library, Wuhan 430072, China)
出处 《现代图书情报技术》 CSSCI 2015年第5期42-49,共8页 New Technology of Library and Information Service
关键词 书目信息 加权LDA模型 多粒度特征 文本分类 特征选择 Bibliographic information Weighted Latent Dirichlet Allocation Multi-granularity feature Text classification Feature selection
作者简介 通讯作者:黄莉,ORCID:0000—0002—3547—3831,E—mail:709934404@qq.com。
  • 相关文献

参考文献21

  • 1HanJ,KamberM,PeiJ.数据挖掘:概念与技术[M].第三版.范明,孟小峰译.北京:机械工业出版社,2012:211-220. 被引量:1
  • 2Yang Y, Liu X. A Re-examination of Text Categorization Methods [C]. In: Proceedings of the 22rid Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 1999: 42-49. 被引量:1
  • 3Blei D M, Ng A Y, Jordan M I. Latent Dirichlet Allocation [J]. Journal of Machine Learning Research, 2003, 3: 993-1022. 被引量:1
  • 4李锋刚,梁钰,GAO Xiao-zhi,ZENGER Kai.基于LDA—WSVM模型的文本分类研究[J].计算机应用研究,2015,32(1):21-25. 被引量:18
  • 5胡勇军,江嘉欣,常会友.基于LDA高频词扩展的中文短文本分类[J].现代图书情报技术,2013(6):42-48. 被引量:21
  • 6黄小亮,郁抒思,关佶红.基于LDA主题模型的软件缺陷分派方法[J].计算机工程,2011,37(21):46-48. 被引量:7
  • 7Chen M, Jin X, Shen D. Short Text Classification Improved by Learning Multi-granularity Topics [C]. In: Proceedings of the 22nd International Joint Conference on Artificial Intelligence. AAAI Press, 2011: 1776-1781. 被引量:1
  • 8Ni X, Sun J T, Hu J, et al. Cross Lingual Text Classification by Mining Multilingual Topics from Wikipedia [C]. In: Proceedings of the 4th ACM International Conference on Web Search and Data Mining. ACM, 2011: 375-384. 被引量:1
  • 9Bao Y, Collier N, Datta A. A Partially Supervised Cross- collection Topic Model for Cross-domain Text Classification [C]. In: Proceedings of the 22nd ACM International Conference on Conference on Information & Knowledge Management. ACM, 2013: 239-248. 被引量:1
  • 10Elhadad M, Gabay D, Netzer Y. Automatic Evaluation of Search Ontologies in the Entertainment Domain Using Text Classification[A]. //Applied Semantic Technologies: Using Semantics in Intelligent Information Processing [M]. Taylor & Francis, 2011: 351-367. 被引量:1

二级参考文献89

共引文献127

同被引文献185

引证文献10

二级引证文献33

投稿分析

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部 意见反馈