About AAN
Welcome to the All About NLP (AAN) project interface! This website is maintained by Yale University's Language, Information, and Learning at Yale (LILY) Group, which is led by Professor Dragomir R. Radev. AAN encompasses our corpus of resources on NLP and related fields and the research projects which build upon this corpus. You can find out more about this project on our project page.
In our current phase of the AAN project, we have collected around 8,000 surveys, tutorials and other resources and created a search engine which allows users to easily browse these resources, which are intended to help anyone learn all about Natural Language Processing (NLP) and related topics to accomplish their NLP goals. We recently introduced this corpus, the TutorialBank Dataset, in our ACL paper TutorialBank: Using a Manually-Collected Corpus for Prerequisite Chains, Survey Extraction and Resource Recommendation. We annotated for the tasks of pedagogical function classification, prerequisite chains and survey extraction and are researching further into each one of these tasks. Download the data! Check out our blog post!
If you use the dataset, please acknowledge the creators and use the following bibtex:
@InProceedings{fabbri2018tutorialbank,
author = {Fabbri, Alexander R and Li, Irene and Trairatvorakul, Prawat and He, Yijiao and
Ting, Wei Tai and Tung, Robert and Westerfield, Caitlin and Radev, Dragomir R},
title = {TutorialBank: A Manually-Collected Corpus for Prerequisite Chains, Survey Extraction
and Resource Recommendation},
year = {2018},
booktitle = {Proceedings of ACL},
publisher = {Association for Computational Linguistics}
}
Acknowledgements
The current version is being maintained by Yale's LILY lab. Specifically we would like to thank the following for their work with this website:
- Pong Trairatvorakul
- Daniel Keller
- Alexander Strzalkowski
- Sydney Young
- Jungo Kasai
- Aaron Pang
- Clark Xie
- Dan Friedman
- Wai Pan Wong
Previous Work
In the previous phases of the AAN project, we created several networks based 20,000 papers from the ACL anthology, these networks include paper citation networks, author citation networks, and author collaboration networks. The network is currently built only using ACL papers published by June 2016 and successfully processed. Our AAN search engine also provides access to the ACL Anthology Network corpus.
A number of students from the University of Michigan's
CLAIR Group helped with the work involved to create the data, network, and webpages of the original version. This first iteration of the website was created by Mark Thomas Joseph and in addition to him we would like to thank:
Charles Welch, YoungJoo (Grace) Jeon, Mark Schaller, Mark Joseph, Ben Nash, Bryan Gibson, John Umbaugh, Tunay Gur, Jahna Otterbacher, Arzucan Ozgur, Li Yang, Anthony Fader, Joshua Gerrish, Stephen Hufnagel, Dr. Igor Markov, Nayeoung Kim, Pradeep Muthukrishnan, Amjad Abu-Jbara, Vahed Qazvinian, Paul Hartzog, Chen Huang, Samantha Boylan, Richard Caneba, Rahul Jha, Hyunzoo Chai, Wanchen Lu, Samuel Smolkin, Luke Brandl, Harry Zhang, Jiajun Peng, Jonathan Kummerfeld, Noriyuki Kojima, Yaoyang Lin, Yi Wan, Yichangle Zhao, Yuan Zhuang, Yulin Xie
The previous version of this site was partially supported by the National Science Foundation grant "Collaborative Research: BlogoCenter - Infrastructure for Collecting, Mining and Accessing Blogs", jointly awarded to UCLA and UMich as IIS 0534323 to UMich and IIS 0534784 to UCLA and by the National Science Foundation grant "iOPENER: A Flexible Framework to Support Rapid Learning in Unfamiliar Research Domains", jointly awarded to UMd and UMich as IIS 0705832.
About the Data
AAN was built from the original pdf files available from the ACL Anthology. Using open source OCR technologies, in-house clean-up scripts, and often tedious manual labor, a web interface was developed that allowed for the annotation of individual references from each paper. We use the following tools for curation.
- PDFBox: We use PDFBox to convert the PDF of the papers to text for further processing.
- ParsCit: We use ParsCit to parse individual references from the text of the publications.
Publications using the AAN data
- Dragomir R. Radev, Pradeep Muthukrishnan, Vahed Qazvinian, Amjad Abu-Jbara. The ACL Anthology Network Corpus. Language Resources and Evaluation Journal, 2013.
- Amjad Abu-Jbara, Jefferson Ezra, and Dragomir R. Radev. Purpose and polarity of citation: Towards nlp-based bibliometrics. In Proceedings of the North American Association for Computational Linguistics, 2013.
- Amjad Abu-Jbara and Dragomir Radev. 2012. Reference Scope Identification in Citing Sentences. The North American Chapter of the Association of Computational Linguistics (NAACL 2012)
- Amjad Abu-Jbara and Dragomir Radev. 2011. Coherent Citation-bases Summarization of Scientific Papers. The 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011).
- Dragomir R. Radev, Pradeep Muthukrishnan, and Vahed Qazvinian. The ACL anthology network corpus. In Proceedings, ACL Workshop on Natural Language Processing and Information Retrieval for Digital Libraries, Singapore, 2009.
- Saif Mohammad, Bonnie Dorr, Melissa Egan, Ahmed Hassan, Pradeep Muthukrishan, Vahed Qazvinian, Dragomir R. Radev, and David Zajic. Generating surveys of scientific paradigms. In Proceedings of HLT-NAACL 2009, Boulder, CO, June 2009.
- Vahed Qazvinian and Dragomir R. Radev. The evolution of scientific title networks. In Proceedings of ICWSM 2009 poster session, San Jose, CA, 2009.
- Aaron Elkiss, Siwei Shen, Anthony Fader, Güneş Erkan, David States, and Dragomir Radev, Blind men and elephants: What do citation summaries tell us about a research article?, Journal of the American Society for Information Science and Technology, 59(1):51-62, 2008.
- Vahed Qazvinian and Dragomir R. Radev. Scientific paper summarization using citation summary networks. In COLING 2008, Manchester, UK, 2008.
- Steven Bird, Robert Dale, Bonnie Dorr, Bryan Gibson, Mark T. Joseph, Min-Yen Kan, Dongwon Lee, Brett Powley, Dragomir R. Radev, and Yee Fan Tan. The ACL anthology reference corpus: a reference dataset for bibliographic research. In LREC, Marrakesh, Morocco, May 2008.
- Rahul Jha, Amjad Abu-Jbara, Vahed Qazvinian, and Dragomir R. Radev. NLP Driven Citation Analysis for Scientometrics.
- Kathleen McKeown, Hal Daume III, Snigdha Chaturvedi, John Paparrizos, Kapil Thadani, Pablo Barrio, Or Biran, Suvarna Bothe, Michael Collins, Kenneth R. Fleischmann, Luis Gravano, Rahul Jha, Ben King, Kevin McInerney, Taesun Moon, Arvind Neelakantan, Diarmuid O’Seaghdha, Dragomir Radev, Clay Templeton, and Simone Teufel. Predicting the Impact of Scientific Concepts using Full Text Features.
- Rahul Jha, Reed Coke, and Dragomir Radev. Surveyor: A System for Generating Coherent Survey Articles for Scientific Topics.
- Rahul Jha, Catherine Finegan-Dollak, Reed Coke, Ben King, and Dragomir Radev. Content Models for Survey Generation: A Factoid-Based Evaluation.
- Kokil Jaidka, Muthu Kumar Chandrasekaran, Beatriz Fisas Elizalde, Rahul Jha, Christopher Jones, Min-Yen Kan, Ankur Khanna, Diego Molla-Aliod, Dragomir R. Radev, Francesco Ronzano, and Horacio Saggion. The Computational Linguistics Summarization Pilot Task.
- Rahul Jha, Amjad abu Jbara, and Dragomir Radev. A System for Summarizing Scientific Topics Starting from Keywords.
- Dragomir Radev, and Amjad Abu-Jbara. Rediscovering ACL Discoveries Through the Lens of ACL Anthology Network Citing Sentences.
- Dragomir R. Radev, Mark Thomas Joseph, Bryan Gibson, and Pradeep Muthukrishnan. A Bibliometric and Network Analysis of the Field of Computational Linguistics.
- Aleks Aris, Ben Shneiderman, Vahed Qazvinian, and Dragomir Radev. Visual Overviews for Discovering Key Papers and Influences Across Research Fronts
Other Related papers
A Note About the PageRank Centrality
Because of the nature of PageRank values, we have adjusted the results to make them more human readable. The actual value of any PageRank on this website can be found by dividing the numbers given by 1,000,000. We also truncate the decimal points, leaving instead only the integer value. So, for example, if a paper has a computed PageRank of 0.003456789 , We would print that PageRank as 3456 after dropping the .789.