The U-M Department of Statistics is sponsoring a data mining competition open to all U-M undergraduates. Students may participate in the competition either as individuals or as part of a team.
Each participant or team will analyze a data set (described further below) and prepare a report. The reports will be judged by a panel of experts. Prizes will be awarded as follows:
First place $500
Second place $300
Third place $200
When a team is awarded a prize, the prize amount will be divided equally among the team members.
Participants are encouraged to think creatively when exploring the data set. The goal is to identify an interesting, surprising, or insightful finding based on the data. This finding should then be carefully described, interpreted, and justified using quantitative data analysis methods.
All contestants will analyze a data set containing information about over 100,000 "notable individuals" who lived at any time from antiquity to the modern era. The data set contains the following fields:
Variable | Description |
PrsID | Person-specific identifier |
PrsLabel | Name of the individual |
BYear | Year of birth |
BLocLabel | Birth location |
BLocID | Identifier of birth location |
BLocLat | Latitude of birth location |
BLocLong | Longitude of birth location |
DYear | Year of death |
DLocLabel | Location of death |
DLocID | Identifier of death location |
DLocLat | Latitude of death location |
DLocLong | Longitude of death location |
Gender | The individual's gender |
The data set also includes the following indicator variables reflecting the activities for which the person is notable:
Variable | Description |
PerformingArts | Performing arts activities |
Creative | Creative activities |
Gov/Law/Mil/Act/Rel | Activities relating to government, military, etc. |
Academic/Edu/Health | Academic or educational activities |
Sports | Sport-related activities |
Business/Industry/Travel | Business or industry-related activities |
All reports must be submitted by email to Gina Cornacchia (ginalc@umich.edu) by 5PM on April 17th, 2015.
The most important judging criterion is to identify an interesting finding in the data, and to support and interpret it in an engaging and accessible way.
Each participant or team must submit one written report in PDF format.
There is no mandated page length, content or structure for the report. A strong report will be focused and engaging to the reader, and should be readable by someone who is not an expert data scientist or statistician.
Use of advanced or specialized techniques will not necessarily be viewed as a strength. If you choose to use advanced techniques be sure to motivate and explain each technique in an accessible manner.
Use of visualization (e.g. graphs and diagrams) is encouraged. Visual materials should be incorporated into the report if possible. A separate file containing visual materials will also be accepted.
Questions about the contest should be directed to Gina Cornacchia.