Tuesday, January 31, 2017
02:30 PM - 03:15 PM
|Level: ||Technical - Intermediate|
This talk describes the use of unsupervised machine learning techniques, especially Word Embedding, to significantly improve search quality. The talk will take you through the technology and process to create a high quality company dashboard from a large number of unstructured data sources. Building such a dashboard reduces manual work required by business analysts and improves accuracy and speed.
We will show how a text or document processing pipeline that involves Ingestion, Enrichment, and Search was built using Machine LEarning, NLP, and Information Retrieval technology.
1. Ingestion - The ingestion step has to deal with extracting textual content from source document formats. Unstructured data typically comes in the form of pdf, word or html documents, and the first step is getting the text content.
2. Enrichment – The ingested data is then run through an NLP pipeline for sentence extraction, part of speech tagging and entity extraction. Supervised techniques such as svm+adagrad and rule based NLP are used to annotate the text with targeted entity and phrase extraction. Entities such as key management, operation locations, key financials, products, and services are extracted from unstructured data sources during the enrichment stage.
3. Content selection and classification - Information retrieval techniques are combined with unsupervised phrase and word expansion with word vectors to get the final version of the company dashboard.
Ashish Kaduskar is a technology lead and architect for the Infosys Information platform. His expertise over the last 4 years has been focused on scalable data, NLP and Information Retrieval.
He has over 10 years of experience including work on reliable and Scalable Architecture for parallel processing, enrichment and indexing of documents and data sources in Spark and SOLR, scalable batch and near real-time NLP pipeline in Spark for content enrichment, integration of SOLR Cloud with Spark NLP Pipeline for scalable indexing and retrieval, architecture for Enterprise scale Semantic content processing solution for a large publishing company, storing and processing PBs of data on 150+ node hadoop clusters, and building/enhancing/ supporting a highly available ETL layer ingesting TBs of data for a well known multinational technology company. He is also experienced with Kafka for high performance messaging and HBASE for distributed key-value store.
John Kuriakose is a technologist with expertise in Language and Semantic technology with a strong ability to conceptualize and deliver relevant solutions for clients. He has led multi-disciplinary multi-continent teams of Researchers, Architects and Engineers to create technology driven differentiation and lead the execution of an organizational strategy around information and analytics by providing the technology primitives for differentiation and customer engagement.
Complex technology architecture definition, value articulation, along with a strong hands-on technology expertise and ability to mentor engineers is a unique combination that he brings to the table. His core technology play has been in Scalable Machine Learning for analytics, distributed computing for complex Information processing in Hadoop and Spark, Text Analytics (Natural Language Processing), Scalable Indexing and Store on Graph and RDF Databases, Reasoning and inference, and Search and Information.
Debanjan Mahata is a Senior Research Associate in the Big Data and Analytics division of Infosys Technologies Ltd.
He holds a Doctorate degree in Information Quality from University of Arkansas at Little Rock.
He is extremely passionate about big data and its future prospects. His current work revolves around developing machine learning models and data pipelines that enable the extraction and exploration of valuable information hidden in large volumes of unstructured data. He has a deep research interest in understanding the quality of information available in natural language text and extracting valuable nuggets of information from them.