Ticket #1 (accepted enhancement)

Opened 3 years ago

Last modified 2 months ago

Semantic Purification of arXMLiv Mathematical Fragments

Reported by: deyan Owned by: deyan
Priority: minor Milestone: Architecture Deployment
Component: Preprocessing Version: 1.0
Keywords: Cc: catalin, sanca, cjucovschi, mgrigore
Blocked By: Blocking:
Due to close: Include in GanttChart: no
Dependencies: Due to assign: YYYY/MM/DD

Description (last modified by deyan) (diff)

This project aims at purifying the mathematical modularity of the documents in the arXMLiv corpus.

Outline


  • Identify mathematical symbols in text mode and transfer them to math mode, in the form of XMath elements
  • Identify natural language segments used in math mode and transfer them to text mode (when void of math semantics)
  • Perform a heuristic search for complex tokens on the basic symbol tokens in the XMath math blocks. Use WordNet and corpus statistics as a reference dictionary.
  • Merge together adjacent math blocks, enlarging the context for formula-level tools and cleaning up from the modularity purification.

For a complete project description, please refer to the attached project proposal and the project report for the v1.0 release.

Implementation


This project has been integrated with the OOPerl LaMaPUn library as module LaMaPUn::Preprocessor::Purify.

New potential improvements


  • Purify punctuation erroneously left inside the very end of Math blocks.
  • Formalize trailing "Reference" sections to <bibliography>
  • Formalize leading section-less text to <abstract> when unambiguous.
  • Purify NL spaces back to math, e.g. when breaking bracketing scopes. $(a,b,$ $c,d)$

Attachments

semantic_purificiation_proposal.pdf Download (103.1 KB) - added by deyan 3 years ago.
Official project description
semantic_purificaion_final.pdf Download (82.6 KB) - added by deyan 3 years ago.
Project report for the v1.0 implementation of the purification

Change History

Changed 3 years ago by deyan

  • status changed from new to assigned

Changed 3 years ago by sanca

Please provide what was to follow :)

Changed 3 years ago by deyan

Official project description

Changed 3 years ago by deyan

  • description modified (diff)

Changed 3 years ago by deyan

  • type changed from enhancement to task
  • milestone set to Knowledge Representation Seminar Spring 09

Changed 3 years ago by deyan

  • status changed from assigned to accepted

Changed 3 years ago by deyan

  • cc catalin, sanca, cjucovschi, mgrigore added

Changed 3 years ago by deyan

  • priority changed from critical to minor
  • version set to 1.0
  • type changed from task to enhancement
  • due_close 2009/05/30 deleted

I am leaving the ticket open, in case ideas of future improvements come to mind. The official development of this project is over, but it is yet to be properly included as an architecture preprocessor module.

That should be taken care of by the time we get a server running, which should be September-October 09.

Changed 3 years ago by sanca

why is this light blue? :)

Changed 3 years ago by deyan

New potential improvements:

  • Purify punctuation erroneously left inside the very end of Math blocks.
  • Make the different purification strategies optional on input to achieve higher versatility of the module.
  • Refactor into OO Perl.

Changed 3 years ago by sanca

Please explain the motivation for refactoring to OO Perl. As far as I know, Perl is not really made to be OO. But of course, I don't know more than just the basics.

Changed 3 years ago by deyan

@OO vs classic Perl: Same reasoning as with C vs C++. OO Perl gives you modularity => makes the software scalable and reusable. And of course easily extensible. So it's a general argument in favor of OO approaches for long term development.

Perl was not originally intended to be OO, yes, but Perl 6 will take care of that in the mid-future and the current solutions are also good enough. Prominent example: LaTeXML is a gigantic OO Perl masterpiece.

Changed 3 years ago by sanca

:) masterpiece indeed (:

I would assume that you foresee some sort of abstract purification class, that each bit of purification functionality derives and implements, so that they all have a common interface?

And then maybe create an array of these purificator objects and run them in series on a single document?

Changed 3 years ago by deyan

I prefer the LaTeXML approach to this interfacing. Namely, have a single object/instance of a purificator that is doing all the work but make it recognize a wide range of options on input, so that you can run it for a myriad of purposes. This has proven to be extremely useful and convenient with LaTeXML and is something I already understand and am willing to adopt in the long run.

Changed 3 years ago by sanca

I don't understand then. This doesn't sound like a OOP approach to me, just like one program with a lot of different functionalities. Please explain the OO part in this.

Changed 3 years ago by deyan

Yes, it is one program with a lot of different functionalities, the OO part is under the hood and is important only for future developers. Basically, all functionality of the preprocessor will be wrapped by a "LaMaPUn::Preprocessor" module and the different functionalities would be separate static functions. However, if the need arises to introduce more depth to some part of the processing, one can immediately extend (and also externally!) the module with an additional LaMaPUn::Preprocessor::Foo submodule, which will then encapsulate the required processing for that particular functionality.

Above, I was explaining the user-side interfacing of what you would be accessing. Instead of "an array of purificator objects" you would have a single (or no) object keeping the internal data and invoke the respective parts based on the users preferences. It's like a driver program that is hitting the right keys on the piano in order to get the right tune. I basically want to make this piano instead of force whoever uses the purificator to be whistling a hard-coded melody.

Pardon my metaphores :)

Changed 3 years ago by deyan

  • description modified (diff)

Changed 3 years ago by deyan

Project report for the v1.0 implementation of the purification

Changed 3 years ago by deyan

  • description modified (diff)

Changed 3 years ago by deyan

  • description modified (diff)
  • milestone changed from Knowledge Representation Seminar Spring 09 to Architecture Deployment

Changed 2 years ago by deyan

A note here so that it doesn't slip my mind:

  • Purification should also abolish equations with no math inside, after NL has been removed from math mode. A shocking example was:
      $$ ACKNOWLEDGMENTS $$
    
    Ideally, this should also preserve the italics (or whichever) font, implied by the $$ $$.

Changed 2 years ago by sanca

Well, hello again 6 months later :)

Shocking example, indeed ... do you have a part of the module checking for this sort of false math at preprocessing?
What is the status of the project now?

Changed 2 years ago by deyan

Yes, v1.0 already detects the natural language and removes the <Math> wrapper. In the example above, $$ invokes a display math mode, which induces an additional <equation> wrapper. The result from purification looks like:

 <equation>
  <text>ACKNOWLEDGEMENTS</text>
 </equation>

I only need a small upgrade that unwraps <equation>s with no <Math> inside.

The project is now in a maintenance stage, I am upgrading it while developing the entire architecture pipeline and problems occur. See the updated ticket description for more info.

Changed 23 months ago by deyan

  • description modified (diff)

Added one more needed purification feature to description.

Changed 23 months ago by deyan

Sloppy modality when explaining a proof:

1) $\Leftrightarrow$ 2)

Probably not possible to find a general treatment, though a heuristic rule could work in most cases. E.g. move the tokens on the left and right to math mode, until a space or end of line is met.

Changed 23 months ago by deyan

Using math mode to typeset \item label:

\item[$\mathrm{(a)}$] 

These can be purified in general, due to the \mathrm hint.

Changed 22 months ago by deyan

Make some effort to gain bracket context. E.g merge into one formula:

$Math1$ ( $Math2$ )

An example from my case study:

$S_i^{(m)} H_\mu$ ( $\in H_\mu$ )

Note that if spaces exist, it is semantically meaningful for them to be preserved when merged into a single formula. A parser can then figure out that there is no relation between the two parts and utilize the context properly.

In the end, we should get an invisible reference to $Math1$ as a first argument to the $\in$, and interpret the parentheses as syntactic fences, when back to an XML content representation. This promises to be quite non-trivial.

Changed 13 months ago by deyan

One more case that needs to be purified successfully:

  • Purify <td> elements
  • Merge math expressions adjacent with an operator, such as = < >
  • Example:
    <td align="center" border="l r t" thead="true">
          <Math mode="inline" tex="P_{{m+2}}^{m}" xml:id="S8.T12.m2">
                <XMath>
                      <XMTok role="UNKNOWN" font="italic">P</XMTok>
                      <XMApp role="POSTSUBSCRIPT" scriptpos="9">
                            <XMArg rule="Subscript">
                                  <XMTok role="UNKNOWN" font="italic">m</XMTok>
                                  <XMTok meaning="plus" role="ADDOP">+</XMTok>
                                  <XMTok meaning="2" role="NUMBER">2</XMTok>
                            </XMArg>
                      </XMApp>
                      <XMApp role="POSTSUPERSCRIPT" scriptpos="9">
                            <XMArg rule="Superscript">
                                  <XMTok role="UNKNOWN" font="italic">m</XMTok>
                            </XMArg>
                      </XMApp>
                </XMath>
          </Math>
          =<Math mode="inline" tex="NHM_{{m+2}}^{m}" xml:id="S8.T12.m3">
                <XMath>
                      <XMTok role="UNKNOWN" font="italic">N</XMTok>
                      <XMTok role="UNKNOWN" font="italic">H</XMTok>
                      <XMTok role="UNKNOWN" font="italic">M</XMTok>
                      <XMApp role="POSTSUBSCRIPT" scriptpos="9">
                            <XMArg rule="Subscript">
                                  <XMTok role="UNKNOWN" font="italic">m</XMTok>
                                  <XMTok meaning="plus" role="ADDOP">+</XMTok>
                                  <XMTok meaning="2" role="NUMBER">2</XMTok>
                            </XMArg>
                      </XMApp>
                      <XMApp role="POSTSUPERSCRIPT" scriptpos="9">
                            <XMArg rule="Superscript">
                                  <XMTok role="UNKNOWN" font="italic">m</XMTok>
                            </XMArg>
                      </XMApp>
                </XMath>
          </Math>
    </td>
    

Changed 11 months ago by deyan

Two further items:

  • We might enhance the LaTeXML Ligatures and Rewrite APIs to allow rewriting this module entirely via the LaTeXML API.
  • When moving text into math (or vice-versa) make sure the font properties are adequately preserved (or repaired).

Changed 10 months ago by deyan

Another problem is the abuse of \bullet for creating bullet points instead of doing real math.

We need a heuristic that spots math at the beginning of a line, checks if it starts with bullet followed by space, and creates an {itemize} with an \item instead.

An dummy example:

Statement, where:\\

$\bullet\quad 1+2=3$

Changed 10 months ago by deyan

And another one:

  • When a math expression starts or ends with an unbalanced fence, check if it is preceded by the closing/opening partner, and add it to the expression.

Example:

($a\in S)$

The last two issues are from real documents in the trainset of my thesis (doc 1 and doc 2 respectively), it is really shocking how sloppy people are on the arXiv.

Changed 8 months ago by deyan

In order to have a maintainable component, we have to structure and systematise the various kinds of purifications we are performing and their respective causes.

A first step to such a systematic view is given in Van der Hoeven's paper  Towards semantic mathematical editing, section 3.2. and specific ideas for correction procedures in 3.3. The paper covers a lot of relevant ground for purification and the structure of math formulas in general.

Changed 2 months ago by deyan

Another example of math mode abuse from ZBL leads to unbalanced fences:

$c_2(c_1,c_2$ are positive constants), 

It might be worthwhile to make a heuristic that checks for fences being balanced in text mode, and if unbalanced ones are found, examines math in the current contexts for a matching unbalanced symbol.

Note: See TracTickets for help on using tickets.