As part of the DFG-funded research project GI 1259/1-1
Methods and Tools to Advance the Retrieval of Mathematical Knowledge from Digital Libraries for Search-, Recommendation- and Assistance-Systems,
we investigate fundamental methods and tools for making mathematical knowledge accessible to information retrieval tools.
Achieving this goal requires methods to reliably extract mathematical knowledge from documents. In the domain of natural language processing (NLP), a number of well-established, general purpose text processing methods and tools exist that are applied to a text to enable domain specific extraction tasks. Similar to state-of-the-art text processing tools, such as the Stanford NLP toolkit, our research will determine how similar tools for processing mathematical language can be realized.
Our approach is to expand upon the concept of Mathematical Language Processing (MLP), a concept for which we have already demonstrated its feasibility when we presented it at the ACM SIGIR conference in 2016 (link to paper). In the context of this project, we expand upon our preliminary research to make the approach more effective and applicable for real world mathematical information retrieval applications. Specifically, the project has the following objectives:
- Identify mathematical formulae and expressions in documents, and reliably differentiate them from similar or neighboring structures.
- Perform type detection and tokenization of mathematical expressions.
- Extract the corresponding mathematical concepts from the tokenized mathematical formulae and expressions.
Our goal is enabling other scientists to use our methods and tools for mathematical language processing to tackle their own novel problems. We hope that MLP will continue to improve during this process, as was once the case for early NLP approaches.
A wide variety of applications would benefit from advancements to mathematical information retrieval. In the STEM disciplines, improvements could be made to academic literature search, literature recommendation, and even plagiarism prevention. Additionally, expert search or applications in pure mathematics, such as theorem search or definition lookup, would significantly benefit from our developments. Applications beyond STEM fields include the improvement of tutoring assistance tools, as well as patent search and enterprise search, which could become more valuable to companies if they integrate math-aware information retrieval methods.