Offered:Fall, Spring, Summer (on-line, open to all CMU graduate students)
Prerequisites:Significant programming experience in Python (ability to design, implement and debug non-trivial programs)
Ability to pick up other programming languages as needed
Instructors:Eric Nyberg, Majd Sakr, John Stamper
Links:Canvas , The Project Zone (for enrolled students)

This course provides an introduction to foundational concepts, learning material and applied, hands-on projects related to the three core areas of Data Science: Computing Systems, Analytics and Human-Centered Data Science. Students completing this course will be prepared for applied research and development in the workplace, as well as further graduate study in Data Science or Artificial Intelligence. Students acquire practical skills in solution design (e.g. architecture, framework APIs, cloud computing), analytic algorithms (e.g. classification, clustering, ranking, prediction), interactive analysis (Jupyter and R) and visualization techniques for data analysis, solution optimization and performance measurement on real-world tasks.

It is our goal that students will develop the skills needed to become a practitioner or carry out applied research and development projects in the domain of computational data science. Specifically, students are exposed to real-world data, and scenarios where they learn how to:

  • Define analytic requirements and develop appropriate questions to guide the solution design process.

  • Design a data gathering plan that incorporates principles of data governance and sovereignty to ensure usability, integrity, security and availability of data.

  • Use univariate and multivariate graphical and non-graphical techniques to identify trends, patterns and outliers in large datasets.

  • Build and deploy models using the appropriate analytic algorithms (such as linear and logistic regression, k-nearest neighbors, naive bayes, k-means and hierarchical clustering among others) to gain understanding from data, make predictions to solve business problems, and inform decision making.

  • Assess the goodness of fit between a model and data using model evaluation metrics and cross validation frameworks to evaluate predictive models.

  • Optimize a model's performance via iterative evaluation of a set of possible approaches and solution pipelines.

  • Select and evaluate data structures and caching approaches for on-demand computation of analytics (e.g. object queries) over large data sets for real-time web deployment, by considering both computation time/cost and storage latency/cost.

We aspire for our students to become independent and resilient problem solvers who are able to overcome challenges while solving open-ended data science problems. The course integrates use of current best practices and existing toolkits for real-world project scenarios, e.g.:
  1. Problem Representation. You have been hired by a movie streaming service provider and are tasked with building a recommendation feature that can suggest a new movie to each user based on their past ratings. The company provides you with a dataset of their available movies and user ratings. You need to apply your data science skills to understand the nature of this dataset and propose a reasonable recommendation approach.

  2. Domain Analysis and Exploration. You have been hired by a government agency and tasked with performing exploratory data analysis to observe the effects of global warming on food production. The company provides you with two different datasets on climate change and food production. You need to apply your data science skills to understand the nature of this dataset and propose a reasonable conclusion to the effects of temperature changes on food production based on important visuals and graphs.

  3. Domain Data Preparation. The coronavirus pandemic has had huge impacts on the world in the last few months, and so much information about it is generated every day. In order to decide on the business plans for the next quarters, your company has assigned you to construct a corpus of public text data on the topic of coronavirus. You will then analyze this data to gain insights into the public's opinion on the current situation.

  4. Machine Learning and Model Performance. You are working for a liquor company and have been tasked with building a predictive model for wine quality, based on known attributes of different types of wines. This model can help determine the prices of newly developed wines. As a first step, you need to run some experimental models to get a better sense of the task. Is this a classification or regression problem? Which learning algorithms are suitable? What are the associated hyperparameters for each candidate learning algorithm?

  5. Model Deployment and Comparison. Your startup is developing a smart camera application that can tag objects included in a photo. As a first step towards development, your manager would like you to find an existing image database to train the object classification model. You have identified CIFAR-10 as a potential dataset; however, due to budget constraints, you are only provided $100 to deliver the complete prototype, which should include a trained model with reasonable generalizability and an API endpoint for deployment.

  6. Optimization of Model Performance. One of the major phases of any machine learning pipeline is the evaluation and optimization of models based on the results of successive evaluation. In this project, you will use the SQuAD dataset to build a question answering system, utilizing various techniques for data preprocessing and question answering tasks, while understanding the strengths and weaknesses of each technique. This project will help you understand the various perspectives on how to optimize solutions in machine learning as you apply different techniques, as well as the caveats for each technique you consider.

  7. Data Structure Selection and Optimization for Large Data Sets. The growth of the homesharing and short-term rental markets has presented opportunities and challenges for communities globally. While for some it encourages tourism and provides additional income streams, for others it exacerbates the shortage of affordable housing. Your goal is to create a website that computes and shares the relevant analyting results, by scraping a data set from the AirBnB website and implementing a helper website that provides a responsive, user-friendly interface that computes analytic results on the fly. In this project you will get hands-on experience with different data types and data structures, and will make decisions on which data structure to apply by analyzing the performance differences caused by different approaches. You will explore different caching strategies and understand time differences in read/write operations from and to different data storage implementations.

Each project is accompanied by one or more primers, which elucidate the required background concepts, data, software and cloud computing resources for each project.

Please contact Eric Nyberg if you have any questions about the course.