0 Prerequisite Data Mining, Data Science 02
I Introduction to Big Data Introduction to Big Data, Big Data characteristics, types of Big Data, Traditional vs. Big Data business approach, Big Data Challenges, Examples of Big Data in Real Life, Big Data Applications
Self-learning Topics: Identification of Big Data applications and its solutions. (Refer chapter 1) 03 CO1
II Introduction to Big Data Frameworks What is Hadoop? Core Hadoop Components; Hadoop Ecosystem; Working with Apache Spark
What is NoSQL? NoSQL data architecture patterns: Key- value stores, Graph stores, Column family (Bigtable) stores, Document stores, MongoDB
Self-learning Topics: HDFS vs GFS, MongoDB vs other NoSQL system, Implementation of Apache Spark. (Refer chapter 2) 06 CO2
III MapReduce Paradigm MapReduce: The Map Tasks, Grouping by Key, The Reduce Tasks, Combiners, Details of MapReduce Execution, Coping With Node Failures. Algorithms Using MapReduce: Matrix- Vector Multiplication by MapReduce , Relational-Algebra Operations, Computing Selections by MapReduce, Computing Projections by MapReduce, Union, Intersection, and Difference by MapReduce, Computing Natural Join by MapReduce, Grouping and Aggregation by MapReduce, Matrix Multiplication, Matrix Multiplication with One MapReduce Step . Illustrating use of MapReduce with use of real life databases and applications.
Self-learning Topics : Implementation of MapReduce algorithms like Word count, Matrix-Vector and Matrix- Matrix algorithm.
(Refer chapter 3) 07 CO3
IV Mining Big Data Streams The Stream Data Model: A DataStream-Management System, Examples of Stream Sources, Stream Queries, Issues in Stream Processing. Sampling Data in a Stream : Sampling Techniques. Filtering Streams: The Bloom Filter Counting Distinct Elements in a Stream : The Count-Distinct Problem, The Flajolet-Martin Algorithm, Combining Estimates, Space Requirements . Counting Ones in a Window: The Cost of Exact Counts, The Datar-Gionis-Indyk, Motwani Algorithm, Query Answering in the DGIM Algorithm.
Self-learning Topics : Streaming services like Apache Kafka/Amazon Kinesis/Google Cloud DataFlow.
Standard spark streaming library.
Integration with IOT devices to capture real time stream data.
(Refer chapter 4) 07 CO4
V Big Data Mining Algorithms Frequent Pattern Mining : Handling Larger Datasets in Main Memory Basic Algorithm of Park, Chen, and Yu. The SON Algorithm and MapReduce. Clustering Algorithms: CURE Algorithm. Canopy Clustering, Clustering with MapReduce Classification Algorithms: Overview SVM classifiers, Parallel SVM, KNearest Neighbor classifications for Big Data, One Nearest Neighbour.
Self-learning Topics : Standard libraries included with spark like graphX, MLlib. (Refer chapter 5) 07 CO5
VI Big Data Analytics Applications Link Analysis : PageRank Definition, Structure of the web, dead ends, Using Page rank in a search engine, Efficient computation of Page Rank: PageRank Iteration Using MapReduce, Topic sensitive Page Rank, link Spam, Hubs and Authorities, HITS Algorithm.
Mining Social-Network Graphs : Social Networks as Graphs, Types , Clustering of Social Network Graphs, Direct Discovery of Communities, Counting triangles using Map- Reduce.
Recommendation Engines : A Model for Recommendation Systems, Content-Based Recommendations, Collaborative Filtering
Self-learning Topics : Sample applications like social media feeds, multiplayer game interactions, retail industry, financial data analysis. Use case like location data, real-time stock trades, log monitoring etc.
(Refer chapter 6) 07 CO6