SciCo: Hierarchical Cross-Document Coreference for Scientific Concepts

Arie Cattan, Sophie Johnson, Daniel S WeldIdo DaganIz BeltagyDoug DowneyTom Hope.



Determining coreference of concept mentions across multiple documents is a fundamental task in natural language understanding. Previous work on cross-document coreference resolution (CDCR) typically considers mentions of events in the news, which seldom involve abstract technical concepts that are prevalent in science and technology. These complex concepts take diverse or ambiguous forms and have many hierarchical levels of granularity (e.g., tasks and subtasks), posing challenges for CDCR. We present a new task of Hierarchical CDCR (H-CDCR) with the goal of jointly inferring coreference clusters and hierarchy between them. We create SciCo, an expert-annotated dataset for H-CDCR in scientific papers, 3X larger than the prominent ECB+ resource. We study strong baseline models that we customize for H-CDCR, and highlight challenges for future work. Data and code are available at


title={SciCo: Hierarchical Cross-Document Coreference for Scientific Concepts},
author={Arie Cattan and Sophie Johnson and Daniel S Weld and Ido Dagan and Iz Beltagy and Doug Downey and Tom Hope},
booktitle={3rd Conference on Automated Knowledge Base Construction},