Reinforcement Learning in the Classroom
Background and Overview
In my previous article, AWS Lex Chatbot in the Classroom, I began exploring how to incorporate machine learning (ML) in the classroom. The myriad technologies and frameworks provided by Amazon Web Services (AWS) made it fairly simple to wire together processes and allowed me to focus on using the truly unique and valuable aspects of the chatbot to create a digital assistant that could provide definitions and a list of related terms in response to a student question. During this proof of concept (POC) project, I looked for opportunities to include other technologies and settled upon Natural Language Processing (NLP) provided by AWS Lex and Latent Dirichlet Allocation (LDA) for topic maps of related words. This seemed to be a good start. As a next step, I wanted to explore how to make the chatbot more of a learning assistant by enabling it to make personalized recommendations, and that led me to Reinforcement Learning (RL). This article will focus on how I integrated an RL algorithm into the Lex Chatbot process to drive a truly custom student experience.
Customizing the Student Experience
With the chatbot in the classroom, we have the opportunity to enhance the way the classroom works. There are many examples in academic settings and commercial offerings of providing technology-driven guidance or "personalized learning." If we can offload some of the personalization workload from teachers, they can devote even more time to understanding overall class progress and providing highly specialized interventions and individualized instruction. With the chatbot, we provide an experience that the student is probably familiar with from other contexts. Our goal then is to add capabilities to the chatbot. Providing a custom user experience becomes possible if we can use technology to identify what needs to be customized for each student.
The challenge then is to identify how each student is unique. One obvious idea is to identify students who have knowledge gaps. If a student doesn't understand material that the teacher is presenting, either the student gets left behind or the instructor slows the other students while providing specific instruction. If we can identify that a student has knowledge gaps through assessments, we can provide needed content directly to that student without impacting their classmates.
A "knowledge gap" is simply something that we expect all students to know or be able to demonstrate, but a specific student does not or cannot. If we expect all students to understand addition, subtraction, multiplication, and division, but one student doesn't understand division, we would consider that a knowledge gap. When we give quizzes and tests, we ideally can see from students' scores if they understand the associated concepts. For this project, the goal is to have the chatbot provide content to students based on their test scores and their demonstrated mastery of the learning objectives. Providing content based on the demonstrated mastery of a subject is where Reinforcement Learning enters the picture.
What is Reinforcement Learning
Reinforcement Learning (RL) is a Machine Learning (ML) approach where actions are taken based on the current state of the environment and the previous results of actions. To put it in context, I'll provide an example.
Let's say that you are playing a game of Tic-Tac-Toe.
Figure 1. Winning Tic-Tac-Toe game.
The objective of the game is to get three of your symbols in a row. Starting with an empty board, you place one of your symbols, and then your opponent places one of their symbols. You alternate placing symbols until either somebody wins or all spaces are filled. This sample provides concrete examples of the important RL terms.
- State - what boxes have what symbols
- Environment - a collection of all of the possible states, not just the current state.
- Action - placing a symbol in a specific box
- State Change - how the environment moves from one state to another, as a result of an action
- Agent - the person (or program) who chooses what action to take
- Reward - whether the action you take leads to a win, loss, or draw.
The following diagram shows the relationship between these terms. In the Tic-Tac-Toe example, t = "turn in the game."
Figure 2. Reinforcement Learning Model. (KDNuggets, 2018).
Given an environment, the agent gets the state. The agent chooses an action based on the expected reward, which is sent to the environment to update the state. Repeat until done.
To walk you through an example, at the beginning of the game (i.e. the starting state), the player (agent) places an X (action) in the center box, which causes a state change. We look at the board and see that nobody has won yet (winning state), and so the next player places an O in the top right corner. Again the state changes, and we're still not in a winning state. We keep repeating until somebody wins or the board is filled.
Why do people play this game? Nobody ever wins. The reason we can say this is because experienced players have learned in which boxes to put their symbol to either move towards a winning state or block their opponent. After playing a number of games, we've learned enough to prevent the opponent from ever winning if we make certain moves.
As we play the game, we are keeping track of the state of the board and what actions lead to a winning state. It is very obvious when we are one move away from winning; if we pick one of the available squares that wins the game, we play that square. It is also fairly obvious when our opponent is only one move away from winning; if we don't pick a square that blocks a winning move, we lose. We can continue moving backwards through possible moves in this way, picking squares that get us to one move from winning or picking a square that prevents our opponent from getting one move from winning.
The number of possible plays is limited. After 8 moves, there is only one option. After 7 moves, there are only 2 options, etc. For each state, we record which moves are possible and which moves have the shortest path to winning or losing. When we win, we look at each move we've made during the game and increase the probability that we'll make that move again. When we lose, we look at each move we've made and decrease the probability that we'll make that move again. As more and more games are played, the probabilities for any given state will reflect the best possible play.
That learning is at the heart of RL. We examine the state of the board, we know which moves will move us towards the winning state or block the opponent from reaching a winning state, and we make that play.
RL Terminology in the Classroom
With Tic-Tac-Toe, the environment is very simple with a limited number of total states, actions, and agents. The reward is well defined. Translating this approach to the classroom is the purpose of this article.
In the classroom, I define the above terms as follows:
- State - how much of the given subject the student has mastered
- Environment - the learning objectives for a given unit of learning such as a topic in Algebra, History, or Biology
- Action - reviewing classroom content, such as reading the textbook, listening to a lecture, working on a class project, playing an educational game, etc.
- State Change - taking a quiz, test, or other assessment to see how much the student learned since the last assessment
- Agent - the student (or program) that chooses which action to take
- Reward - Final grade for the class, subject, topic, or other assessment
Representing state then becomes a question of assessing what the student knows. Historically, we only have two states: mastered or not mastered. We represent this as Passed or Failed for the course, with the student having the option of retaking the subject. Since the only state transition occurs at the very end of the class, we have little opportunity to identify gaps in the student's knowledge or to do anything about them. A student often takes quizzes and tests during the course of the term, but these assessments never cause a state transition; that only occurs at the end of the term. We want to find a way to represent the state in a more granular fashion, so that we can present content appropriate to the student's current understanding of the material.
I chose to represent a subject as a collection of topics. This could be considered roughly equivalent to chapters in a book, but it could be other types of learning units or learning paths as well. At the end of a topic, we would give the student a test. If the student passed the test, we would consider the topic mastered and move on to another topic. If not, we might either repeat the content presented or provide some additional assistance until the student masters the topic. In this way, each student would progress at their own pace.
When students consume content, such as reading the textbook or playing a game, they are working towards mastery of a specific topic. Once they've finished, we need to assess them to see if they have indeed learned the material. This combination of content and assessment is what drives the RL process. In the chatbot project, I use an automated agent to pick the content that the student will consume. The value here is that through the RL algorithm, the agent can pick the best content, based on the results of every student that has come before. The agent will know which content is more likely to move the student towards mastery because it can look at every student who has seen that content while in that topic state and see how they progressed. If after consuming a specific piece of content, the student scores poorly on the assessment, the agent will recommend that content less frequently to future students. If a specific piece of content leads directly to mastery for a topic or topics, the agent will recommend it more frequently.
One of the assumptions here is that the student has multiple choices for content. Likely content might include a recorded video lecture and a course textbook or other reading, or interacting with a simulation. Other content will likely be required. If only one source of content is available, obviously the agent will have to return the same recommendation each time. Something to consider is that content may be relevant to multiple topics, or that a single topic may require multiple pieces of content to master. The more content that is available, the better.
The RL Agent picks the best content for each state by calculating the reward for each piece of content and returning the content that provides the best reward. We have to consider two situations here when determining what the reward is for a piece of content. We need to consider future value, and we need to consider unexplored content.
A winning state is a state where we can say that the student has mastered the subject. We might consider mastery of every topic to be a winning state. If the student only has one topic left to master, and if the student consumes a specific piece of content, and that piece of content always leads to mastery of that topic, the decision of what piece of content to consume is very simple. However, most of the time, a single piece of content will not take you from a regular state to a winning state. You may have only mastered 5 of 15 topics. The agent needs to choose a piece of content to present, and it does this by looking ahead through the list of state transitions and calculating the path that has the highest value. From a given state, the agent has multiple paths to a mastered state and multiple pieces of content to present along these various paths. The agent will pick the best content that returns the highest reward. The reward can be affected by likelihood of mastering a topic or topics, amount of time spent consuming the content, and length of the path to a winning state. The algorithm can choose greedily by selecting content that has a larger immediate reward or can choose based on a larger long term reward. The reward calculation I implemented was very simple; I assigned a large reward if the content moved to a winning state, a negative reward if not, and let the greedy algorithm discount based on the number of steps required to reach a winning state.
All Content and All Students are not created equal. Some content is poorly written, confusing, or even wrong. This type of content will eventually be excluded by the learning agent because the results will show only small rewards from using it. Students learn differently. Some may be faster learners or have better learning strategies. Some come into a class with some knowledge of a subject that others do not have. Therefore, assessments of a student who consumes content can vary significantly. Over time, we expect that the algorithm will be able to infer what is the best content. This variability, however, may lead to some early successes that hide even better paths through the course. We need to introduce some variability into the selection of course content to ensure that we are not missing good content just because some lower performing students were presented with it early in the model building process. We refer to this tradeoff as "explore versus exploit." We may know that we have a good solution in hand, but we cannot know if we have the best possible solution without occasionally looking at content that has been assigned a lower reward. The QLearning algorithm, which we'll discuss below, can be configured to balance exploration against exploitation. QLearning takes a parameter, alpha, which can be any value between 0 and 1. At one end, if alpha = 0, the algorithm ignores any new learning when setting the state/action value; it relies entirely on history. At the other end, if alpha = 1, the algorithm relies entirely on the newly calculated value, ignoring what has happened before.
Leveraging the Chatbot
In our proof of concept (POC) Spring Boot application, we introduced the AWS Lex Chatbot and described how it could leverage Natural Language Processing (NLP) to identify the intent of a student's question and then return a response. We treat the RL content recommendation process as simply another intent. I gave the chatbot an intent of "Recommend." When the student asks the chatbot to "recommend content," the chatbot identifies the intent and then calls a service to return the content to present. As far as the chatbot is concerned, this process is identical to other processes that return static content or content from a database. The service that is called, however, is an implementation of an RL agent.
The QLearning Algorithm
In RL, data scientists have identified multiple machine learning (ML) algorithms that can perform the necessary state transitions. I searched for one that had an implementation in Java, could be easily implemented, and was well understood. I came upon an implementation of the QLearning algorithm in GitHub that provided a clean implementation, was configurable, and that I could easily use to build a model. Throughout my exploration of AWS and ML, I have been looking for components that I could easily bolt together rather than implement myself, and this implementation of QLearning met my needs well.
- It had hyperparameters that could be easily configured. A hyperparameter is a configurable parameter that controls behavior within the algorithm. For example, how often the algorithm will choose to explore versus exploit is a hyperparameter.
- It could use a Greedy or Lazy rewards calculation algorithm. Greedy algorithms prioritize the reward for the next actions; Lazy algorithms prioritize the overall reward to reach the winning state.
- It had an easy interface for requesting content by state.
- It had an easy interface for building a model. A model is the summary of what the agent has learned. It holds what states exist, what content exists, what content has been tried at what state, and what the reward has been for that content at that state.
- It had an easy interface for exporting and importing models.
Capturing the Data
For our agent, we needed four pieces of information:
- The student's starting state.
- The content presented.
- The student's ending state.
- The reward for this state transition.
The student's starting and ending states are determined by the student taking an assessment. Assessing competence of students is its own complicated problem, and I will not attempt to explain it here. However, we do depend upon knowing that the student has mastered a topic or not. Every time a student takes an assessment, we need to be able to map the results of that assessment to a state in the course. Let's say, for example, that Student A has mastered Topics 1, 2, and 3 in a Biology course. We present Student A with a lecture that covers the contents of Topic 4. We can then assess the student on the content of Topic 4. If he passes the test, we can say that he has now mastered Topic 4. We can then change the student's state to say that he has mastered Topics 1, 2, 3, and 4. We need to capture that state change from only knowing Topics 1, 2, and 3 to knowing Topics 1, 2, 3, and 4.
In my Spring Boot application, I wrote a mock assessment tool. The tool allows the user to set the state directly for topics in the course. It doesn't actually verify comprehension; it asks no subject questions. For the purposes of capturing state transitions, it met my goal. Eventually, a content and an assessment tool developed in tandem should replace my mock assessment.
When the student asks the chatbot to "recommend content," the agent service will return the recommended content. We capture that content id as well.
Finally, the reward is calculated, based on an algorithm I wrote. The initial version of the algorithm simply assigns a reward of 100 when transitioning to a winning state and a reward of -1 otherwise. A "winning state" is defined as any state where 80% of the topics have achieved mastery. This algorithm could be much more complicated or flexible, but for this POC it was adequate.
Building the Model
We've assumed to this point that the agent running inside a chatbot service already had a functional model, even if the model was based on little or no data. We're going to describe here how that model is actually built.
Building a model requires two types of data: the content ID and the assessment. The assessment tells us what state the student is now in. The content ID tells us what content the student reviewed. The first assessment taken records the state. Then, the content reviewed is stored. Then the next assessment is taken and that state is recorded, also. This back and forth is the path that the student walks as they master the subject. The RL agent writes the content ID to a Kinesis Firehose to ensure delivery and performance. The assessment tool writes the new student state to the same Kinesis Firehose. I added a lambda function to the firehose to process the data.
The lambda has three roles:
- when an assessment is taken, create a database record for the student and course that includes the state after it is completed. This will be the starting state for the next assessment.
- when the content is sent, update that database record with the content id.
- when the next assessment is taken, retrieve the database record to get the starting state and the content id. Then, delete the record. Write the starting state, content id, and ending state from the assessment to a CSV file.
This process repeats for each assessment that is taken, with the ending state for each assessment becoming the starting state for the next one.
With the data in a CSV file that contains all of the state transitions, I simply need to feed those state transitions into the Agent's model building function one at a time. As more transitions are added, the model gets better at calculating future rewards for content based on state. As this was a proof of concept, I had very little data to use to build the model, but even little data was adequate to get valid recommendations for content based on the student's state. I wrote a standalone program to feed these state transitions into a QLearning model. This is the only manual part of the process.
Once all of the data had been fed into the model, I exported the model as a JSON file to an S3 folder. When the Lex Lambda function restarts, as it will automatically after becoming idle for a while, the Agent service will load the new model automatically from S3.
Thus we have a closed loop for data gathering and model generation for the RL Agent.
This process seems somewhat convoluted, mainly because there was no preliminary design when creating the process. As I discovered the needs, I added pieces to the process. I already have ideas on how this process can be simplified. However, because this process relies on humans to provide data, and because I cannot guarantee if or when the data will be available, I had to create an interim step to store data while the Content/Assess process was in progress.
Encoding the State
The first challenge was how to represent the student's mastery of a subject so that an RL algorithm could process it. After discussions with Marquess Lewis, Unicon's CTO, I decided that treating a subject as a collection of topics would be reasonable, and that each topic would have some level of mastery by the student. I wanted the states to be related to existing course chapters but not necessarily map exactly to them. I wanted the instructor to have some flexibility in how to structure the course, and I've had many courses where chapters are either skipped, presented out of order, or taught with other chapters as a unit. I decided on calling these units "topics" and allowing the instructor to say what a topic contained.
I also wanted the instructor to be able to divide a course into many topics and to allow for different levels of mastery within a topic. This presented a technical problem: how many states are needed to represent a course? Let's say that you have a course that only has one topic, and the topic is either mastered or not. The course for that student can be in one of 2 states. Now let's say that the course has two topics, and each topic can be either mastered or not. The course for that student can now be in one of 4 states. 3 topics leads to 8 states, 4 topics to 16 states, etc. 16 topics would give 65,536 states (2^16 states). In the OpenStax Biology online textbook, the course is divided into 47 chapters, which implies 2^47 states. Supporting literally trillions of possible states is not technically feasible, so we need to have some limits on the number of states. In addition, for my POC, I chose to allow a topic to be in one of 3 states: beginner, advanced, and master. For Biology, that would imply 3^47 states, which is even less manageable. Finally, the implementation of the QLearning algorithm I chose uses a 32-bit integer to store states for its calculations. This is a limit of 4 billion+ states; we can't even get to trillions with this implementation.
I chose to limit the number of states by encoding the state of each topic as a 2-bit binary number, allowing up to 4 values. I only use three. Then I store up to 16 topics in the state by encoding the topic as its 2-bit state.
For example, I encode beginner as 00, advanced as 01, and master as 10.
Here's examples of how I encode a student's state
00000000 00000000 00000000 00000000 - Topics 1 through 16, beginner, topic 1 on the far left, topic 16 on the far right
10101010 10101010 10101010 10101010 - Topics 1 through 16 master, topic 1 on the far left, topic 16 on the far right
10101010 01010101 00000000 00000000 - Topics 1 through 4 master, topics 5 through 8 advanced, topics 9 through 16 beginner
This encoding allows up to 4 billion possible states for a subject, 16 topics with up to 4 states per topic.
Limiting Content Availability
I did not implement this for my example, but I easily could have and almost certainly will have to in the future. For a course, assume that it has 100 pieces of content. Given 4 billion states and 100 pieces of content, this is quite a bit of data to store in a database, let alone in memory. We can limit the number of pieces of content that are available at a given state by storing with each piece of content whether it is visible when a prerequisite topic has been mastered or is hidden if a topic has already been mastered. Since the student's state is stored using a binary encoding, we could create a binary encoding for this mapping as well.
For example, assume that Student A has mastered Topics 1, 2, and 3 but hasn't mastered Topic 4 or anything later. Now say you have content that teaches Topic 1, Topic 4, and Topic 13. You would only want to present Topic 4 content, since the student has already mastered Topic 1, and has not yet mastered the prerequisites for Topic 13. Ideally, you would only present a choice from a relatively small subset of content, perhaps 5-10 pieces. You may not want to be too aggressive with filtering, since the agent should learn itself over time which content is valuable at each state. Also, the student should still be able to access all content from outside of the agent's recommendation, since the student may want to review the content that he has already mastered or may want to read ahead.
Limiting State Transitions
With a 32-bit state representation, the application has the potential to represent 4 billion unique states. In practice, students would likely never be in that many discrete states. For example, I currently only allow 3 states for a topic: beginner, advanced, and master. That fourth state is never used. Second, if the course has 16 topics, a new student would likely never transition directly from no mastery to all mastery after a single piece of content. Most transitions would not be possible since content is focused on specific topics. We have options on how to assess the student, and if the assessments only allow some state transitions, the course will never use all transitions. I believe that we should have small assessments after each piece of content and then larger assessments at expected points within the course. Small assessments may only allow recording the mastery of a specific topic. Larger assessments may allow for mastery of multiple topics. As I am not experienced with the theories behind effective student assessments, I'll leave this as a future exercise.
Agents in Lambdas
My initial implementation created a service that loaded the model as part of the Lex Lambda project. For a small model, that was fine. For a model with millions of states and transitions and many pieces of content, the RL Agent within the lambda will take too long to load, and will consume too much memory when running. The current service is very simple and could be included just about anywhere else if desired. I had at least two thoughts.
- Spring Boot and Docker in ECS, presented as a REST endpoint. I could have the Lex Intent service call an external REST endpoint, and that REST endpoint call the agent that runs as a service inside the Spring Boot application. The advantage here is the ability to scale through Docker containers. As long as the ECS instance running the Docker container is big enough to hold the model, you can automatically scale. On the downside, you now have a full ECS instance running continually, and the cost of having this service running may be a concern. You also need to work up a process to reload the model if it changes. You could integrate the model building process in here directly and remove the last manual step in the feedback loop.
- API Gateway connecting to the model published by AWS SageMaker. This is more of a research project for a couple of reasons. First, the QLearning algorithm implementation I'm using is in Java; I would need to find a Python version. Second, I haven't used the AWS SageMaker publishing capability or built an API gateway to front the service. Like the Spring Boot solution above, the Lex Chatbot would call a REST endpoint, in this case the API Gateway front to the model. This may be a design change and wouldn't have the same flexibility that the Spring Boot + Docker option has, at least without doing further research.
Multiple Pieces of Content between Assessments
In order to fairly reward content, we need to assess every time content is presented. If, for example, a student watches a lecture and then reads the course book and then takes a test, how do we know if the lecture or the textbook taught the student better? Maybe the student could have skipped one of them entirely and gotten the same result. At this point, I do not have an answer to this question. One solution that I've thought about is to chain pieces of content together as a "super content" block and treat that with its own content id. As you might imagine, if we have 100 pieces of content, allowing 2 pieces of content to be a block like this has literally thousands of permutations. We would almost certainly have to limit the number of pieces of content available at a state to a fairly small number (less than 10). Further research would be required.
Ordering of State Changes
I didn't appreciate this one at first, and I suspect that as the amount of data is fed into the model increases, it will be less relevant. However, I noticed that from a cold start (i.e. little data in the model), the order in which I entered state transitions had a significant effect on the model. For example, say that a subject has three topics and that the student has no prior knowledge of the subject. The student views content for Topic 1, is assessed, and shows mastery. The student then views content for Topic 2, is assessed and shows mastery. Finally, the student views content for Topic 3, is assessed and shows mastery. By mastering all three topics, the student shows mastery of the subject. In our simple reward algorithm, only the last assessment would return a large reward; the two previous assessments would return a negative reward.
When the model is built, the order that the state changes matter. The first state change associates the content with no reward. The second state change associates the content with no reward. The third state change associates the content with a large reward. The next student who takes the class will be presented content, but the model will not know that the content presented had any effect on the results, since the model was built with data that did not know about the path that led to the winning state. The same problem exists for the second topic. When we get to the third topic, we see that the specific piece of content led directly to the winning state, so that piece of content will likely be returned again.
However, if we feed the state changes in reverse order, with the Topic 2 to Topic 3 winning state first, the behavior changes. When Topic 1 to Topic 2 is fed in, the model knows that there is a path from Topic 2 to Topic 3. It sees that the content presented led directly to Topic 2 mastery and so that content will also be prioritized. Then the initial state to Topic 1 is fed in, the model knows that there is a path now from Topic 1 to Topic 3 and will prioritize the content presented then, too.
As more and more data is fed into the system, the model will eventually learn that there is a path from no mastery to Topic 3 mastery via various pieces of content. By presenting a finished path in reverse order, however, the initial model will be much better in its early predictions.
My current process for gathering data through the RL Lambda and RL Modeler makes no effort to capture this information and process it in order. You'll notice that the CSV that holds the transitions knows nothing about the student taking the content, and that would be required to present an actual student path through the information. Providing a reverse ordering would require a rewrite of this process, most likely with the database that stores the state and content information storing ending state and not writing to the CSV until a winning state is reached. It is certainly technically doable, but it is not currently implemented.
In the end, the POC was a functional demonstration of how RL techniques could be incorporated into a classroom setting, leveraging the AWS Chatbot as the driver for the user experience. A sample workflow would look something like this:
- Student logs into the application and selects the subject that they wish to work on.
- From the subject page, the student can ask the chatbot various questions; eventually, the student asks the chatbot to recommend content.
- The chatbot understands the intent and calls the RL agent.
- The RL agent retrieves the student's current state and returns the content to the application to render.
- The student interacts with the content as they would any other content.
- Once complete, the student takes an assessment on the content they've just reviewed.
- The assessment is scored, and the student's state is updated to reflect their new understanding of the subject.
The existing POC handles all of these steps, although the content and assessment tools were pseudo implementations. In addition, each of these steps have implementations that may not be scalable, secure, or provide a coherent user experience. Since the goal was to identify the tasks associated with implementing RL, I never expected to provide a production quality release. I did identify how to realistically implement RL in the classroom, and I did discover that existing technology and techniques are available now. With a bit of forethought, organization, and a well-defined goal, a small team of developers can quickly add this capability to the classroom.
- AWS Lex Chatbot in the Classroom
- Reinforcement Learning - https://en.wikipedia.org/wiki/Reinforcement_learning
- QLearning Implementation - https://github.com/chen0040/java-reinforcement-learning
- Reinforcement Learning Diagram - KDNuggets (2018). Retrieved from https://www.kdnuggets.com/2018/03/5-things-reinforcement-learning.html