# Mathematical Information Retrieval (MathIR)

As part of the DFG-funded research project GI 1259/1-1

**Methods and Tools to Advance the Retrieval of Mathematical Knowledge from Digital Libraries for Search-, Recommendation- and Assistance-Systems**,

we** **investigate fundamental methods and tools for making mathematical knowledge accessible to information retrieval tools.

Achieving this goal requires methods to reliably extract mathematical knowledge from documents. In the domain of natural language processing (NLP), a number of well-established, general purpose text processing methods and tools exist that are applied to a text to enable domain specific extraction tasks. Similar to state-of-the-art text processing tools, such as the Stanford NLP toolkit, our research will determine how similar tools for processing mathematical language can be realized.

Our approach is to expand upon the concept of Mathematical Language Processing (MLP), a concept for which we have already demonstrated its feasibility when we presented it at the ACM SIGIR conference in 2016 (link to paper). In the context of this project, we expand upon our preliminary research to make the approach more effective and applicable for real world mathematical information retrieval applications. Specifically, the project has the following objectives:

- Identify mathematical formulae and expressions in documents, and reliably differentiate them from similar or neighboring structures.
- Perform type detection and tokenization of mathematical expressions.
- Extract the corresponding mathematical concepts from the tokenized mathematical formulae and expressions.

Our goal is enabling other scientists to use our methods and tools for mathematical language processing to tackle their own novel problems. We hope that MLP will continue to improve during this process, as was once the case for early NLP approaches.

A wide variety of applications would benefit from advancements to mathematical information retrieval. In the STEM disciplines, improvements could be made to academic literature search, literature recommendation, and even plagiarism prevention. Additionally, expert search or applications in pure mathematics, such as theorem search or definition lookup, would significantly benefit from our developments. Applications beyond STEM fields include the improvement of tutoring assistance tools, as well as patent search and enterprise search, which could become more valuable to companies if they integrate math-aware information retrieval methods.

## RELATED PUBLICATIONS

#### 2020

**Discovering Mathematical Objects of Interest – a Study of Mathematical Notations**

A Greiner-Petter, M Schubotz, F Müller, C Breitinger, HS Cohl, A Aizawa, B Gipp

*Proceedings of the Web Conference 2020 (WWW’20), April 20–24, 2020, Taipei, Taiwan*

DOI: 10.1145/3366423.3380218 Preprint__Core Rank A*__

#### 2019

**Improving Academic Plagiarism Detection for STEM Documents by Analyzing Mathematical Content and Citations**

N Meuschke, V Stange, M Schubotz, M Kramer, B Gipp

*Proceedings of the Annual International ACM/IEEE Joint Conference on Digital Libraries (JCDL)*

DOI: 10.1109/JCDL.2019.00026 Preprint__Core Rank A*__**AnnoMath TeX- a Formula Identifier Annotation Recommender System for STEM Documents**

P Scharpf, I Mackerracher, M Schubotz, J Beel, C Breitinger, B Gipp

*Proceedings of the 13th ACM Conference on Recommender Systems 2019, Copenhagen, Denmark, September 16-20, 2019*

DOI: 10.1145/3298689.3347042 Preprint Bibtex Homepage__Core Rank B__**Why Machines Cannot Learn Mathematics, Yet**

A Greiner-Petter, T Ruas, M Schubotz, A Aizawa, W Grosky, B Gipp

*4th Joint Workshop on Bibliometric-Enhanced Information Retrieval and Natural Language Processing for Digital Libraries co-located with the 42nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval*

PDF**Semantic Preserving Bijective Mappings for Expressions Involving Special Functions in Computer Algebra Systems and Document Preparation Systems**

A Greiner-Petter, M Schubotz, HS Cohl, B Gipp

*Aslib Journal of Information Management*

DOI: 10.1108/AJIM-08-2018-0185 Preprint Bibtex**Towards Formula Concept Discovery and Recognition**

P Scharpf, M Schubotz, HS Cohl, B Gipp

*Proceedings of the 4th Joint Workshop on Bibliometric-Enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL 2019) co-located with the 42nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Paris, France, July 25, 2019.*

PDF Preprint Bibtex**Forms of Plagiarism in Digital Mathematical Libraries**

M Schubotz, O Teschke, V Stange, N Meuschke, B Gipp

*Proceedings International Conference on Intelligent Computer Mathematics*

DOI: 10.1007/978-3-030-23250-4_18 Preprint

#### 2018

**Improving the Representation and Conversion of Mathematical Formulae by Considering their Textual Context**

M Schubotz, A Greiner-Petter, P Scharpf, N Meuschke, HS Cohl, B Gipp

*Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries (JCDL)*

DOI: 10.1145/3197026.3197058 Preprint Bibtex__Core Rank A*__**HyPlag: A Hybrid Approach to Academic Plagiarism Detection**

N Meuschke, V Stange, M Schubotz, B Gipp

*Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval*

DOI: 10.1145/3209978.3210177 Preprint Bibtex__Core Rank A*.__**Automated Symbolic and Numerical Testing of DLMF Formulae Using Computer Algebra Systems**

HS Cohl, A Greiner-Petter, M Schubotz

*Intelligent Computer Mathematics – 11th International Conference, CICM 2018, Hagenberg, Austria, August 13-17, 2018, Proceedings*

DOI: 10.1007/978-3-319-96812-4_4 Bibtex**MathTools: An open API for convenient MathML handling**

A Greiner-Petter, M Schubotz, HS Cohl, B Gipp

*Intelligent Computer Mathematics – 11th International Conference, CICM 2018, Hagenberg, Austria, August 13-17, 2018, Proceedings*

DOI: 10.1007/978-3-319-96812-4_9 Bibtex**Towards Formula Translation Using Recursive Neural Networks**

F Petersen, M Schubotz, B Gipp

*Proceedings of the 11th Conference on Intelligent Computer Mathematics (CICM)*

PDF Preprint Bibtex**Representing Mathematical Formulae in Content MathML Using Wikidata**

P Scharpf, M Schubotz, B Gipp

*Proceedings of the 3rd joint workshop on bibliometric-enhanced information retrieval and natural language processing for digital libraries (BIRNDL 2018) co-located with the 41st international ACM SIGIR conference on research and development in information retrieval (SIGIR 2018), ann arbor, USA, july 12, 2018.*

PDF Preprint Bibtex**Generating OpenMath Content Dictionaries from Wikidata**

M Schubotz

*Joint Proceedings of the CME-EI, FMM, CAAT, FVPS, M3SRD, OpenMath Workshops, Doctoral Program and Work in Progress at the Conference on Intelligent Computer Mathematics 2018 co-located with the 11th Conference on Intelligent Computer Mathematics (CICM 2018)*

DOI: 10.5281/zenodo.1409946 Preprint**Mathematische Formeln in Wikipedia**

M Schubotz

*Beiträge zum Mathematikunterricht 2018*

DOI: 10.17877/de290r-19676 Preprint Bibtex**Introducing MathQA – a Math-Aware Question Answering System**

M Schubotz, P Scharpf, K Dudhat, Y Nagar, F Hamborg, B Gipp

*Proceedings of the Annual International ACM/IEEE Joint Conference on Digital Libraries (JCDL), Workshop on Knowledge Discovery*

DOI: 10.1108/IDD-06-2018-0022 Preprint Bibtex