Data Science Team Companion Course (CSCI 4802/5802)
When: Tuesdays 5-6:15pm
Where: (Fall 2022) ITLL 1B50
Professor: Rafael Frongillo
Captain/TA: Rick Nueve
Office hours: Rick, TBD
Team website: codata.colorado.edu
Communication: Slack
Team signups: Spreadsheet
Assignments/grades: Canvas
Prerequisites: linear algebra or permission of instructor
Course Description
Gives students hands-on experience applying data science techniques and machine learning algorithms to real-world problems. Students will work in small teams on on projects of their choosing, which could include competitions sponsored by local companies and organizations. Project teams are responsible for attending, submitting progress reports, and giving short presentations when appropriate.
Motivation
Data science is one of the fastest-growing sectors of our economy, and there is a great demand for data scientists with practical experience applying statistical techniques and machine learning algorithms to real data. While several courses in the CS curriculum develop these techniques, in the areas of machine learning, statistical modeling, network science, numerical analysis, and data science more broadly, and while these courses often include a hands-on project, no course specifically focuses on putting this myriad of tools to work on real data and developing intuition for when to apply certain techniques over others. The present course will fill in this gap, allowing students to work in teams both small and large to solve real-world prediction challenges, gaining valuable experience whether entering the workforce or remaining in academia.
Topics
To accompany the prediction challenges and other activities hosted by the team, we will have short presentations on topics relevant to the current competition or data science more broadly. A non-exhaustive list of topics is as follows.
- Basic Concepts: classification and regression, prediction vs causation, regularization and overfitting.
- Algorithms: linear regression, logistic regression, support vector machines, boosting, decision trees and forests, neural networks, gradient and stochastic gradient descent.
- Practical Techniques: ensemble methods and aggregation, tradeoffs in regularization, and parameter and hyperparameter tuning, data imputation techniques, cross-validation.
- Software and Tools: tutorials on several modern data science software packages; as of this writing, this would include e.g. scikit-learn, pandas, vowpal wabbit, and xgboost.
- Context and Industry Practice: via weekly presentations from practicing data scientists, students will learn about techniques actually used in industry and academia, and which algorithms work well for which problems.
Assessment
The general requirement for the course is to do something cool and tell us about it. Typically you will have the option of joining some larger team-wide effort, like a competition or an outreach project, or to form your own small group of 1-3 students to work on a project of your choosing, subject to approval. The specifics of team-wide efforts like competitions or outreach will change from semester to semester. To document your progress, you will submit three written reports detailing and reflecting on what you have done and/or plan to do. Since every student starts from a different place, we will grade you based on how well you document your learning and growth, rather than a specific target.
For more information about the team, please visit the team website.