CMSC320 Final Project

Summary

In lieu of a final exam, CMSC320 students will turn in a tutorial that will walk users through the entire data science pipeline: data curation, parsing, and management; exploratory data analysis; hypothesis testing and machine learning to provide analysis; and then the curation of a message or messages covering insights learned during the tutorial. Students may choose an application area and dataset(s) that are of interest to them; please feel free to be creative about this!

Remember, the course resources page has a number of data repositories from where you can download data. You could also create your own dataset using scraping skills you have learned in class. We in fact, encourage this.

The tutorial should be self-contained as an Rmarkdown document, and delivered as a GitHub statically-hosted Page (described below). You can see examples here:

For R - https://www.kaggle.com/kernels?sortBy=votes&group=everyone&pageSize=20&language=R
For python - https://www.kaggle.com/kernels?sortBy=votes&group=everyone&pageSize=20&language=Python

In general, the tutorial should contain at least 1500 words of prose and 150 lines of (nonpadded, legitimate) R or Python code, along with appropriate documentation, visualization, and links to any external information that might help the reader.

Github Pages

GitHub provides a service called Pages (https://pages.github.com/) that provides website hosting functionality backed by a GitHub-based git repository. We would like you to host your final project on a GitHub Pages project site. To do this, you will need to:

  1. Create a GitHub account (or use the one you already have) with username username.
  2. Create a git repository titled username.github.io; make sure username is the same as whatever you chose for your global GitHub account.
  3. Create a project within this repository. This is where you’ll dump your Rmarkdown file and an HTML export of that Rmarkdown file.

The deliverable to the CMSC320 staff will then be a single URL pointing to this publicly hosted GitHubs Pages-backed website.

Grading

We will assign a numeric score between 1 and 10 for each of the following six criteria:

  1. Motivation. Does the tutorial make the reader believe the topic is relevant or important (i) in general and (ii) with respect to data science?

  2. Understanding. After reading through the tutorial, does an uninformed reader feel informed about the topic? Would a reader who already knew about the topic feel like s/he learned more about it?

  3. Other resources. Does the tutorial link out to other resources (on the web, in books, etc) that would give a lagging reader additional help on specific topics, or an advanced reader the ability to dive more deeply into a specific application area or technique?

  4. Prose. Does the prose portion of the tutorial actually add to the content of the deliverable?

  5. Code. Is the code well written, well documented, reproducible, and does it help the reader understand the tutorial? Does it give good examples of specific techniques?

  6. Subjective evaluation. If somebody linked to this tutorial from, say, Hacker News, would people actually read through the entire thing?

Group Work

Final projects can be prepared in groups of at most three members. On ELMS, each individual in a group will be asked to submit the link to the github page hosting their project, plus a statement about group composition and contributions. Further instructions will be available on the ELMS submission page.