Ticket #1 (accepted enhancement)
Semantic Purification of arXMLiv Mathematical Fragments
| Reported by: | deyan | Owned by: | deyan |
|---|---|---|---|
| Priority: | minor | Milestone: | Architecture Deployment |
| Component: | Preprocessing | Version: | 1.0 |
| Keywords: | Cc: | catalin, sanca, cjucovschi, mgrigore | |
| Blocked By: | Blocking: | ||
| Due to close: | Include in GanttChart: | no | |
| Dependencies: | Due to assign: | YYYY/MM/DD |
Description (last modified by deyan) (diff)
This project aims at purifying the mathematical modularity of the documents in the arXMLiv corpus.
Outline
- Identify mathematical symbols in text mode and transfer them to math mode, in the form of XMath elements
- Identify natural language segments used in math mode and transfer them to text mode (when void of math semantics)
- Perform a heuristic search for complex tokens on the basic symbol tokens in the XMath math blocks. Use WordNet and corpus statistics as a reference dictionary.
- Merge together adjacent math blocks, enlarging the context for formula-level tools and cleaning up from the modularity purification.
For a complete project description, please refer to the attached project proposal and the project report for the v1.0 release.
Implementation
This project has been integrated with the OOPerl LaMaPUn library as module LaMaPUn::Preprocessor::Purify.
New potential improvements
- Purify punctuation erroneously left inside the very end of Math blocks.
- Formalize trailing "Reference" sections to <bibliography>
- Formalize leading section-less text to <abstract> when unambiguous.
- Purify NL spaces back to math, e.g. when breaking bracketing scopes. $(a,b,$ $c,d)$
Attachments
Change History
Note: See
TracTickets for help on using
tickets.

