2015 Data Mining Competition
The U-M Department of Statistics is sponsoring a data mining competition open to all U-M undergraduates. Students may participate in the competition either as individuals or as part of a team.
Each participant or team will analyze a data set (described further below) and prepare a report. The reports will be judged by a panel of experts. Prizes will be awarded as follows:
- First place $500
- Second place $300
- Third place $200
When a team is awarded a prize, the prize amount will be divided equally among the team members.
Participants are encouraged to think creatively when exploring the data set. The goal is to identify an interesting, surprising, or insightful finding based on the data. This finding should then be carefully described, interpreted, and justified using quantitative data analysis methods.
The data set
All contestants will analyze a data set containing information about over 100,000 "notable individuals" who lived at any time from antiquity to the modern era. The data set contains the following fields:
Variable Description
PrsID Person-specific identifier
PrsLabel Name of the individual
BYear Year of birth
BLocLabel Birth location
BLocID Identifier of birth location
BLocLat Latitude of birth location
BLocLong Longitude of birth location
DYear Year of death
DLocLabel Location of death
DLocID Identifier of death location
DLocLat Latitude of death location
DLocLong Longitude of death location
Gender The individual's gender
The data set also includes the following indicator variables reflecting the activities for which the person is notable:
Variable Description
PerformingArts Performing arts activities
Creative Creative activities
Gov/Law/Mil/Act/Rel Activities relating to government, military, etc.
Academic/Edu/Health Academic or educational activities
Sports Sport-related activities
Business/Industry/Travel Business or industry-related activities
Contest rules
All reports must be submitted by email to Gina Cornacchia ([email protected]) by 5PM on April 17th, 2015.
The most important judging criterion is to identify an interesting finding in the data, and to support and interpret it in an engaging and accessible way.
Each participant or team must submit one written report in PDF format.
There is no mandated page length, content or structure for the report. A strong report will be focused and engaging to the reader, and should be readable by someone who is not an expert data scientist or statistician.
Use of advanced or specialized techniques will not necessarily be viewed as a strength. If you choose to use advanced techniques be sure to motivate and explain each technique in an accessible manner.
Use of visualization (e.g. graphs and diagrams) is encouraged. Visual materials should be incorporated into the report if possible. A separate file containing visual materials will also be accepted.
Questions about the contest should be directed to Gina Cornacchia.