The U-M Department of Statistics is sponsoring a data mining competition open to all U-M undergraduates. Students may participate in the competition either as individuals or as part of a team.

Each participant or team will analyze a data set (described further below) and prepare a report. The reports will be judged by a panel of experts. Prizes will be awarded as follows:

  • First place $500
  • Second place $300
  • Third place $200

When a team is awarded a prize, the prize amount will be divided equally among the team members.

Participants are encouraged to think creatively when exploring the data set. The goal is to identify an interesting, surprising, or insightful finding based on the data. This finding should then be carefully described, interpreted, and justified using quantitative data analysis methods.

The data set

Link to the data set

All contestants will analyze a data set containing information about over 100,000 "notable individuals" who lived at any time from antiquity to the modern era. The data set contains the following fields:

Variable              Description

PrsID                   Person-specific identifier

PrsLabel              Name of the individual

BYear                  Year of birth

BLocLabel           Birth location

BLocID                Identifier of birth location

BLocLat               Latitude of birth location

BLocLong            Longitude of birth location

DYear                  Year of death

DLocLabel           Location of death

DLocID                Identifier of death location

DLocLat               Latitude of death location

DLocLong            Longitude of death location

Gender                  The individual's gender

The data set also includes the following indicator variables reflecting the activities for which the person is notable:

Variable                                Description

PerformingArts                    Performing arts activities

Creative                               Creative activities

Gov/Law/Mil/Act/Rel         Activities relating to government, military, etc.

Academic/Edu/Health         Academic or educational activities

Sports                                   Sport-related activities

Business/Industry/Travel     Business or industry-related activities

Contest rules

All reports must be submitted by email to Gina Cornacchia ( by 5PM on April 17th, 2015.

The most important judging criterion is to identify an interesting finding in the data, and to support and interpret it in an engaging and accessible way.

Each participant or team must submit one written report in PDF format.

There is no mandated page length, content or structure for the report. A strong report will be focused and engaging to the reader, and should be readable by someone who is not an expert data scientist or statistician.

Use of advanced or specialized techniques will not necessarily be viewed as a strength. If you choose to use advanced techniques be sure to motivate and explain each technique in an accessible manner.

Use of visualization (e.g. graphs and diagrams) is encouraged. Visual materials should be incorporated into the report if possible. A separate file containing visual materials will also be accepted.

Questions about the contest should be directed to Gina Cornacchia.