Building the “Alexa, write my quiz” Skill Part 2: Review & Application of Machine Learning Technologies for Glossary Text Extraction

Mary Gwozdz,
Software Engineer
 

Machine learning allows us to consume large quantities of data in order to bring new life to expansive corpuses in a variety of industries, such as education. The sheer quantity of digitizable educational materials, including textbooks, lecture notes, class notes, and academic publications is astounding, not to mention all of the hobbyists and creators who have started making educational content specifically for the Internet via blog posts, videos, and games. With the growing plethora of information available at the fingertips of artificial intelligence (AI), it feels like it is only a matter of time before AI can generate highly coherent materials to present to a learner on a topic, providing invaluable assistance to teachers.

As a stepping stone to the future I described, I have embarked on a journey to see if I can use machine learning to extract term-definition pairs from pdf glossary files to algorithmically create vocabulary quizzes without human intervention. This journey was inspired by the work I wrote about in my previous article, where I created an Alexa skill that generated vocabulary quizzes from pdf files by writing code that used Apache PDFBox to extract the term/definition pairs. While that worked well, it would require numerous tweaks to be usable with a large variety of text layouts. I believed that switching to using machine learning (as shown in Figure 1) would help solve this problem, enabling me to be able to generate quizzes from a larger set of documents without having to continuously write updates to the code. If quiz data could be quickly organized from pdf files, then the potential amount of quiz materials that Alexa has available to use would also greatly increase.

 

A2-Figure1

Figure 1: The flow of my Alexa skill consists of the user informing Alexa of the exam source, Canvas file name, subject, quiz name, and number of questions. The alexa-stream-handler then triggers the write-quiz-lambda to pull any relevant pdf documents from Canvas, convert them into csv files of quiz questions, then send those back to the user. Previously, this conversion was completed with customized Java using Apache PDFBox, but I am going to change out that component for a machine learning algorithm.

 

More specifically, I wanted to construct a pathway for sending pdf glossary files (such as the Openstax Biology textbook glossary shown in Figure 2) through a machine learning algorithm to convert them into sets of term/definition pairs.

 

A2-Figure2

Figure 2: A sample of the pdf glossary used from an OpenStax Biology textbook.


Ideally, I hoped to find an algorithm that had already been trained to discern name-value pairs so that it could easily be dropped into the spot of the black box in Figure 1 without too much extra work. However, in the absence of finding such an algorithm, I hoped to at least be able to send the pdf file into a machine learning algorithm (MLA) that had already been trained to output the text inside that file and then develop my own algorithm to distinguish term/definition pairs from any additional information that the existing MLA could provide about the text. The tools I considered to help accomplish this were Amazon Textract and Tesseract as MLAs that have already been trained to output the text in pdf files along with additional information, as well as Amazon SageMaker A2I Human Workflow and Amazon Comprehend to help me develop an algorithm to distinguish term/definition pairs from the MLA output.

The Tools

Textract

Amazon’s Textract is machine learning software that extracts text from image and pdf files. In addition, when you send a document to the Textract client for analysis, it returns a set of json blocks containing positional information about the text as shown in Figure 3a. Each json block has parameters for BlockType, Confidence, Geometry, Id, Text, and, if present, Relationships, which are explained in Table 1. With a bit of additional coding, the coordinates listed in the Polygon section of the output can be used to draw literal boxes representing each BoundingBox as shown in Figure 3b.

 

A2-Figure3a

Figure 3a: An example of a LINE json block returned from Textract


 

A2-Figure3b

Figure 3b: A visualization of the geometric data provided from Amazon Textract. The corners of each of these boxes correspond to a set of coordinates listed in the Polygon section of the json output from Textract. This image also shows the LINE blocks being outlined in black while the WORD blocks are outlined with green at the beginning of a word and red at the end of a word.
 

Json Parameter

Description

BlockType

Possible values include LINE, WORD, and PAGE

Confidence

Indicates probability that the text was correctly discerned

Geometry

Contains a BoundingBox and a Polygon. The BoundingBox contains the Height, Left, Top, and Width of the block of text detected, and the Polygon contains coordinates for the corners of a polygon that contains the block of text detected.

Id

Contains the unique identifier for this block

Text

Contains the text that Textract detected in this block

Relationships

Lists the Ids of all of the WORD blocks within a LINE block or of all the LINE blocks within a page.

Table 1: Descriptions of Textract json output parameters


Amazon SageMaker A2I Human Workflow

Amazon Textract is also trained to read key-value pairs from form data, meaning that if it is provided with a set of questions or prompts from a form, it can collect the answers to those questions.

To train the Textract machine learning algorithm to understand your form, you can provide it with a set of keys and test data and then utilize Amazon’s SageMaker A2I Human Workflow to check and correct its work. The Human Workflow is a process that a team of people called a Human Review Workforce goes through to select whether or not Textract correctly detected the value for each key, and if not, what the value should have been. The Human Review Workforce can be a team of your own people, or Amazon can provide you with a team of people known as an Amazon Mechanical Turk Team.

Figure 4 shows an example of the workflow for checking Textract key-value pairs. The column on the right lists out the keys from the key-value pairs where the Workforce team member can specify the correct value in the blank space underneath it. In the example in Figure 4, Textract was not able to fill in the blank spaces with the correct values. I suspect that this is primarily because Textract does not automatically divide text into columns before making interpretations and secondarily because the name-value pair recognition is designed for the context of forms and not glossaries.

 

A2-Figure4

Figure 4: Example of Sagemaker A2I Workflow for checking key-value pairs identified by Textract. It is notable that Textract was not able to locate the values for any of the given keys in this example.


Amazon Comprehend

Amazon Comprehend is a service that can ingest text and output an indication of what words in the text correspond to certain entities, key phrases, or parts of speech. It also outputs the detected language and positive, negative, or neutral sentiment of the text. Amazon Comprehend’s built-in entities can recognize branded commercial items, dates, events, locations, organizations, people, quantities, and titles. However, Amazon Comprehend also has a feature for custom entity recognition, where you can train it to recognize additional entity types.

Custom entity recognition can take place in two different ways. One way, known as the Annotation method, is to submit csv files that indicate a given text file, the new entity type, the line number where the entity is located in the given file, and the start and stop character indexes of where the entity is located in the given file as shown in Figure 5.

 

A2-Figure5

Figure 5: Example of an annotations csv file for training Amazon Comprehend to recognize the custom entity type TERM.


These csv files then go through Amazon Comprehend’s machine learning training process to create a model that can read the provided text files and predict the location of the new entity type and then test this model for the accuracy of its predictions.

The second way to train Amazon Comprehend for custom entity recognition is the Entity List method. The Entity List method consists of submitting csv files containing a list of known values and their corresponding entity type to Amazon Comprehend’s machine learning training process. Interestingly, Amazon Comprehend has already been trained with a large set of additional medical entities, so that it can now provide Amazon Comprehend Medical, which recognizes most of the entities that are found in medical charts or records. This ability for Amazon Comprehend to work with a high level of accuracy for recognizing medical entities in context shows great promise for this technology to be applied to additional fields.

Tesseract

Tesseract is Google’s optical character recognition (OCR) service, putting it in direct competition with Amazon Textract as it also takes in images as input and can output the text it reads from the image.  Tesseract provides the PyTessBaseAPI, which can be utilized to detect the text and extract individual words, along with their confidences and font attributes such as an indication of whether the text is bold, italicized, underlined, and the font size. Figure 6 below shows the python code for printing out the information that Tesseract provides on individual words.

from tesserocr import PyTessBaseAPI

images = ['openstaxbiology-ch1-1.png']

with PyTessBaseAPI() as api:
   for img in images:
       api.SetImageFile(img)
       # print(api.GetUTF8Text())
       # print(api.AllWordConfidences())
       api.Recognize()
       ri=api.GetIterator()
       level=tesserocr.RIL.WORD
       counter = 0
       for e in tesserocr.iterate_level(ri, level):
         counter = counter + 1
         word = e.GetUTF8Text(level)
         if counter < 25:
           print(word)
           print(e.WordFontAttributes())
           print()

Figure 6: Example of retrieving information about detected words with Tesseract.

In conclusion, the things I learned about these four technologies are summarized in Table 2 below.

 

Tool

Description

Amazon Textract

Identifies individual words and their location on the page

Amazon Sagemaker A2I Human Workflow

Will train Textract to find name-value-pairs when the list of names are given

Amazon Comprehend

A machine learning algorithm that was trained to read prose or form data. It recognizes a limited set of entities but can be trained with Custom Entity Recognition which takes in a .txt file and requires a large csv of known data.

Tesseract

Google’s OCR engine, which was the only OCR that included a library for reviewing font attributes, if they are available.

Table 2: Summary of research done on machine learning tools.

Application

With knowledge of these four technologies, I started to construct a pathway for transforming my pdf glossary files into sets of term/definition pairs using machine learning. Upon converting the pdf file to an png image and sending it through Textract, I noticed that while it correctly translated the image into the text that was on the page, this text was not in the right order. Textract had not realized that this document consisted of two columns, so it ran the lines of text together across the page. Curious as to why this was and how to mitigate the issue, I adapted this method for getting Textract to read a newspaper to produce the result shown in Figure 7. This method consisted of selecting a target line, then looping through all of the remaining lines to find the ones that are in the same column as the target line. After making some adaptations to their approach, I had three metrics that needed to be met in order for a given line in the loop to be selected as belonging to the column of the target line. These metrics are as follows, with similarity being determined by taking the difference between the two values and comparing them to a hard coded tolerance:

  1. If the given line had a similar x-coordinate for the left-most side of the target line
  2. If the given line’s height was similar to the height of the target line
  3. If the distance between the bottom y value of the given line and that of the line above it was below a hard coded tolerance

 

A2-Figure7

Figure 7: A visualization of dividing Textract results appropriately into columns.


At this point, I could accurately identify the separate columns or regions of the text to put in the correct order. Next, I needed to determine how to retrieve the term/definition pairs. As described above, the Textract key-value-pair recognition in combination with the SageMaker A2I Human Workflow did not work for retrieving term/definition pairs. As a human reading this document, the most obvious indicator of these pairs to me is the bold text for each term. Hoping to be able to use Tesseract’s ability to retrieve font attributes to identify the words with bold text, I sent my glossary through the code shown in Figure 6. Unfortunately, it returned None for the WordFontAttributes for each word in the file, and this appears to possibly be due to a bug in Tesseract dating back to 2017.

Next, I attempted to train Amazon Comprehend to recognize the terms in my ordered Textract output utilizing the annotation method as shown in Figure 5. In order to create those annotation files, I wrote an algorithm that used Textract’s geometrical data to calculate the amount of space between each of the words. As shown in Figure 3b, the amount of space between each term and its corresponding definition is considerably greater than the amount of space between the remainder of the words in the document. Therefore, any words that had a quantity of space between themselves and the following word that was larger than a certain tolerance were deemed to be terms. Once the terms were identified, I just needed to count the number of characters in each term and identify the line number it resided on so that I could create the annotation file. After I created several annotation files corresponding to various glossaries from OpenStax textbooks, I sent the data through Amazon Comprehend. The results indicated that Amazon Comprehend could recognize the TERM entity type with a precision of 88%, a recall of 94%, and an F1 score of 91%.

At this point, I noticed three issues with my approach:

  1. I was only extracting this data one page at a time, so if there was ever a definition that went onto the next page, that information would get lost in translation.
  2. I only had annotations for recognizing the TERM entity type, so I wasn’t training it to be able to recognize definitions. However, I hoped that if it could recognize the terms well, then it would be simple to write an algorithm to just gather up the text between terms as the definitions.
  3. The F1 score, which is the harmonic mean of the precision and recall, was 91%. It seemed like this wouldn’t be good enough, since it meant that Amazon Comprehend would be making an incorrect prediction as to what the terms were about one out of every ten times.

Consequently, I began to attempt to mitigate the first problem by reading the glossary pages for each chapter together at once. To do this, I added code to remove the headers and footers from the pages and connected their ordered text together into one output text file that the annotation file would reference. Once I had recreated the annotations files after addressing the nuances that arose from this change of strategy, I sent my data back through Amazon Comprehend. Unfortunately, I found that the results had gotten much worse with a precision of 35%, a recall of 63%, and an F1 score of 45%.

At this point, I began to rethink my approach altogether. I wondered why I was trying to get Amazon Comprehend to identify the terms in the file when my code to create the annotations files to train Amazon Comprehend already identified the terms in the first place. It occurred to me that instead of using Amazon Comprehend, I could just improve my data science approach to identifying the terms based off of the geometric data from the Textract output and likely get better results faster.

Therefore, I tried to step back and take a holistic view of how I was calculating the order of the text, the removal of extraneous text, and the location of the terms themselves. I continued using the first metric from the newspaper column detection method of determining whether or not the left coordinate of each line of text was close enough in value to the left coordinate of each line in its column. I also kept the third metric of determining if the bottom y values of each line were sufficiently distanced from each other to indicate that one line belonged precisely below another. However, I switched from using the second metric of trying to eliminate titles from the text based off of each line’s height to trying to eliminate them based off of each line’s area per character. Because line height was calculated from the top coordinate of the highest letter to the bottom coordinate of the lowest letter in each line, certain lines that had letters like “b” and “g” would get flagged for being titles and lines that only had very short heighted letters such as “wax” would get flagged for being headers or footers. To prevent this, I calculated the area of each line and then divided it by the number of characters in the line to get a better valuation of the font size of the line.

Next, I did further research to improve my statistical approach to defining some of the threshold values for these metrics. Initially, following the approach of the newspaper column detection method, I simply hardcoded my tolerance values and adjusted them as needed. However, across three different textbooks with certain pages just having certain nuances that made them different, like having a word that was printed further off to the side than expected, I needed to set my tolerance values based off of the extent to which each case deviated from what was expected.

In addition to my initial hard coded tolerance values, I also tried out the outlier formula and the z score method. The outlier formula and the z score method worked best for my needs, so I used them accordingly. The outlier formula defines upper tolerance as the third quartile plus 1.5 multiplied by the interquartile range between the first and third quartiles as shown in Figure 8a. In the z score method, I defined z as the number of standard deviations away from the mean I wanted the cutoff to be, then the tolerance equalled the mean plus the standard deviation multiplied by z, as shown in Figure 8b. My results from these experiments are illustrated in Table 3, with the best statistical approach being to use the outlier formula for calculating the vertical distance tolerance and the z score formula for calculating the area per character tolerance.

IQR = Q3 - Q1
tolerance = Q3 + 1.5(IQR)
Figure 8a: The outlier formula for upper tolerance.

tolerance = μ + zσ
Figure 8b: The z score formula for tolerance.

 

 

Statistical Approach

Average Number of Incorrect Term/Definition Pairs Per Chapter

Overall Percentage of Correct Term / Definition Pairs

Vertical Distance Tolerance

Area Per Character Tolerance

Biology Textbook

Anatomy & Physiology Textbook

Psychology Textbook

Overall

Hard coding

Outlier formula

0.40

2.20

0.00

0.87

98.94%

Hard coding

Z score formula

0.40

0.40

0.00

0.27

99.55%

Outlier formula

Outlier formula

0.00

2.20

0.00

0.73

99.22%

Outlier formula

Z score formula

0.00

0.40

0.00

0.13

99.83%

Z score formula

Outlier formula

15.40

61.00

45.80

40.73

41.81%

Z score formula

Z score formula

15.60

61.00

45.80

40.80

41.68%

Table 3: This table shows the average number of incorrect term/definition pairs for five OpenStax textbook end of chapter glossaries that I tested across each of three different textbooks. The average number of term/definition pairs for the Biology, Anatomy & Physiology, and Psychology textbooks are 53, 83, and 63 respectively. From these calculations, I concluded that using the outlier formula to calculate vertical distance tolerance while using the z score formula to calculate the area per character tolerance resulted in the highest accuracy of 99.83%.

 

With all of the adjustments in place, I switched from creating annotations files out of my identified terms to creating term definition pairs. Across each of the three OpenStax textbooks that I pulled five glossaries from, I could now identify term/definition pairs with 99.83% accuracy, which I felt was sufficient for updating my Alexa skill. Now it should be able to use a much larger set of glossary files to create vocabulary quizzes than it was able to with the version that used Apache PDFBox.

Conclusion

Machine learning has a massive potential to change the world, and the education industry has such an abundance of data that its combined potential is even greater. While I ultimately only used Amazon Textract for the needs of my experiment, each of the machine learning technologies that I discussed also has a phenomenal potential to impact the world and education specifically. For example, the Textract key-value-pair detection with the SageMaker A2I Human Workflow could potentially make a great homework grader. Amazon Comprehend could be trained to read textbooks themselves or administrative forms. With further time and research, I suspect machine learning could still be used to extract term/definition pairs from glossaries as well. In the meantime, my data science approach to the problem is working great.

Useful Reading:

Mary Gwozdz

Mary Gwozdz

Software Engineer

Mary Gwozdz is a Software Engineer at Unicon, joining first as an intern in 2016 before becoming a full-time developer in 2018.

While at Unicon, Ms. Gwozdz has contributed numerous updates to the California Community Colleges (CCC) Application websites in efforts to modernize and improve its content and efficiency. She created Vagrant scripts to replace the environment setup process, decreasing the setup time from 1 month to 1 day. She also used Groovy and AWS to reconcile and improve the movement of encrypted data between its AWS architecture and other applications. Additionally, she wrote multiple Spring Boot REST web services for handling application data and deployed them as Docker containers to Rancher with Jenkins.

Before joining Unicon, Ms. Gwozdz did biomedical engineering research at the University of Texas. Her research resulted in two publications, one for her work in detecting genetic anomalies on a molecular level, and one for her development of a machine learning algorithm for detecting swallows in dysphagia patients.

Ms. Gwozdz graduated magna cum laude from the University of Texas at Austin with a BS in biomedical engineering with a technical emphasis in software engineering.

Insights and News

Top