Textual Machine Learning for Topic Extraction and Document Similarity Matching

MALLET is an open source toolkit for statistical natural language processing (NLP), topic modeling, clustering, and other machine learning (ML) applications over text.

Topic modeling is a ML approach to analyze large volumes of unlabeled data and automatically extract latent "topics" from the text. Such topics can be used to identify document similarity in a deeply contextual way.

In this session, we demonstrate how to use the open source tool MALLET for textual ML including topic modeling. We then go on to demonstrate how we have applied this ML technique to develop an application for fast, deep document similarity matching.

Mr. Mark Wallace is Principal Software Engineer for Semantic Applications at Modus Operandi, where he serves as lead ontologist and semantic mentor for the company. Since 2002, Mr. Wallace has focused on practical applications of semantic technologies. He has taught and presented at the Semantic Technology and Smart Data Conferences for the past seven years. He is a key contributor to all three generations of the Modus Operandi Semantic Wiki. Recently he has led teams in the application of textual machine learning and pattern of life analytics over big data. He has over 25 years of experience in software development in both commercial and military applications, has published papers in the area of Semantics, and has worked with big data since 2009. Mr. Wallace has a B.S. in Computer Science from the University of Central Florida.