Logistics

Instructor

Guest Co-Instructor


Content

What is this course about? [Info Handout]

The course will discuss data mining and machine learning algorithms for analyzing very large amounts of data. The emphasis will be on MapReduce and Spark as tools for creating parallel algorithms that can process very large amounts of data.
Topics include: Frequent itemsets and Association rules, Near Neighbor Search in High Dimensional Data, Locality Sensitive Hashing (LSH), Dimensionality reduction, Recommendation Systems, Clustering, Link Analysis, Large-scale Supervised Machine Learning, Data streams, Mining the Web for Structured Data, Web Advertising.

Previous offerings

The previous version of the course is CS345A: Data Mining which also included a course project. CS345A has now been split into two courses, CS246 and CS341.

You can access class notes and slides of previous versions of the course here:
CS246 Websites: CS246: Spring 2023 / CS246: Winter 2022 / CS246: Spring 2021 / CS246: Winter 2020 / CS246: Winter 2019 / CS246: Winter 2018 / CS246: Winter 2017 / CS246: Winter 2016 / CS246: Winter 2015 / CS246: Winter 2014 / CS246: Winter 2013 / CS246: Winter 2012 / CS246: Winter 2011
CS345a Website: CS345a: Winter 2010

Prerequisites

Students are expected to have the following background:

The recitation sessions in the first weeks of the class will give an overview of the expected background.

Reference Text

The following text is useful, but not required. It can be downloaded for free, or purchased from Cambridge University Press.
Leskovec-Rajaraman-Ullman: Mining of Massive Dataset


Schedule

Lecture slides will be posted here shortly before each lecture. If you wish to view slides further in advance, refer to 2023 course offering's slides, which are mostly similar.

This schedule is subject to change. All deadlines are at 11:59pm PST.

Date Description Suggested Readings Events Deadlines
Tue Jan 9 Introduction; MapReduce and Spark
[slides]
Thu Jan 11 Frequent Itemsets Mining
[slides]
Colab 0, Colab 1, Homework 1 out
Sat Jan 13 Recitation: Spark tutorial
[Colab]
Tue Jan 16 Locality-Sensitive Hashing I
[slides]
Thu Jan 18 Locality-Sensitive Hashing II
[slides]
Colab 2
out
Colab 0,
Colab 1
due
Thu Jan 18 Recitation: Linear Algebra
[handout]
Fri Jan 19 Recitation: Probability and Proof Techniques
[handout]
Tue Jan 23 Clustering
[slides]
Thu Jan 25 Dimensionality Reduction
[slides]
Colab 3, Homework 2 out Colab 2,
Homework 1 due
Tue Jan 29 Recommender Systems I
[slides]
Thu Feb 1 Recommender Systems II
[slides]
Colab 4
out
Colab 3
due
Tue Feb 6 PageRank
[slides]
Thu Feb 8 Extensions of PageRank to Recommendations and Spam
[slides]
Colab 5, Homework 3 out Colab 4,
Homework 2 due
Tue Feb 13 Community Detection in Graphs
[slides]
Thu Feb 15 Learning Embeddings
[slides]
Colab 6
out
Colab 5
due
Tue Feb 20 Graph Representation Learning
[slides]
Thu Feb 22 Graph Neural Networks
[slides]
Colab 7, Homework 4 out Colab 6,
Homework 3
due
Tue Feb 27 Decision Trees
[slides]
Thu Feb 29 Mining Data Streams I & II
[slides]
Colab 8
out
Colab 7
due
Tue Mar 5 Computational Advertising
[slides]
Thu Mar 7 Optimizing Submodular Functions
[slides]
Colab 9
out
Colab 8,
Homework 4
due
Mon Mar 11 Exam
Tue Mar 12 Bandits
[slides]
Thu Mar 14 Scaling ML
[slides]
Colab 9
due