Joint work with Ilan Lobel and Joshua Reed
Abstract: Supply chain networks are often the most complex systems that modern firms deal with. These systems are composed of external entities whose incentives may not be aligned with the firm's incentives. Such misalignment impacts the firm's performance and the efficiency of the whole system. In this paper, we consider the problem of demand learning and long-term revenue maximization faced by a firm selling through an external sales network. The firm is not able to control its product experimentation and needs to rely on the decisions made by its sales agents. The salesforce, however, is myopic and thus uninterested in experimentation. We model this system in a continuous time multi-armed bandit framework and explore the problem of the firm that interacts with the bandits through a set of independent myopic agents. We identify novel characterization results that give insight into the steady-state and transient dynamics of this system. These results help quantify the loss of the firm due to the myopic behavior of the agents. We show that if the firm utilizes a policy of dropping products determined to be suboptimal, it can generate a near-optimal amount of experimentation for the firm. This policy distributes the burden of experimentation across the salesforce, thus utilizing the size of the salesforce to speed up learning.