Case Study: Querying a Video Dataset Using Explainable A.I.

Published in

vialab research

11 min readMar 8, 2022

Explainable Artificial Intelligence (XAI) is an emerging field of artificial intelligence that focuses on the development an AI models that can explain their behavior on some level to human collaborators. These explanations can be developer-focused to help in understanding, designing, and improving models, or user-focused wherein the model’s explanations are intended to help the user make informed decisions when using the XAI.

Current attempts at XAI interfaces are mostly geared towards artificial intelligence experts and those familiar with the model, visualizing the internal features of a model as well as its output. However, there have been recent advancements in interpretability, trust, and explainability of XAI systems that inspired us to research ways to create XAI models and explanation interfaces that are useful not just to A.I. professionals familiar with the model, but anyone who is familiar with the data being operated on.

In our research paper “Learn, Generate, Rank, Explain: A Case Study of Visual Explanation by Generative Machine Learning” we explore the feasibility and utility of XAI that explains its reasoning in an intuitive visual manner while performing user-generated queries on a dataset of motion-capture videos.

In pursuit of this goal, we developed a Generative Adversarial Neural Network Model (also known as a GAN) for querying the database, as well as an interactive dashboard for users to query the data and investigate the A.I. model’s reasoning. Using these tools, we then conducted a study with 44 participants to study the effects of our XAI model on user behavior versus a more traditional “black-box” AI model and determine whether users benefitted from a deeper understanding of the model’s interpretation of the search query.

The A.I. Models

For our case study, we used two different models, an XAI powered by a Generative Adversarial Neural Network (GAN), and a more standard “black-box” model using a Convolutional Neural Network architecture (CNN) to compare our XAI model against. When developing these models, we specifically ensured that they performed almost identically to one another in terms of the outputs they produced.

Our XAI Model

For this project, we used the CMU Motion Capture Database. This dataset contains 2548 videos of different actors performing 1095 unique activities along with the accompanying motion capture data where the actors’ joints are accurately localized in 3D space.

Instead of feeding the video data itself to our XAI model, we use the 3D motion capture data associated with the videos to query the dataset for videos of actors performing specified actions/activities.

Our explainable A.I. model works by taking the user’s query as input, generating several examples of motion that it thinks the query describes as well as examples it thinks the query doesn’t describe. We call these generated examples “supports”. The model then uses these supports to query the dataset for motion and activities it thinks match the user’s query based on the supports it generated.

The Black-Box Model

The discriminative ranking model takes the query as input and feeds it to a recurrent neural network language model, as well as the raw video clip which it feeds to a 1D residual convolutional neural network. The model then computes a matching score between the video clip and the query which is used to rank the video clip in the search results.

The User Dashboard

Our dashboard interface acts as a mediator between the XAI model and the user; it allows the user to send queries to the explainable A.I., and view the results of that query along with the supports it generates.

The design of the dashboard is geared towards data-domain experts who don’t necessarily have a deep knowledge of artificial intelligence but are familiar with the data on which the system operates.

A screenshot of the User Dashboard with annotations labeling different components of the interface.

In the top left of the dashboard, the user can type a search query and select the querying algorithm (generative or discriminative). The results are shown in a ranked list underneath the search area, along with a confidence bar illustrating where in the video clip the query is likely to occur (red indicating higher confidence). The user can select a video from the results which is then shown in the center of the dashboard. If the explainable generative model is used, then the supports it generated are shown in a list on the right side of the dashboard as evidence that the video clip matches the user’s query. The supports are sorted to show the most important evidence first. Clicking on one of the support clips opens a pop-up video player for the user to review the generated evidence in detail.

In short, the generative ranking interface allows the user to drill down into deeper evidence for the provided answers whereas with the discriminative ranking interface, the ability to understand the A.I.’s decision ends with the ranked list and confidence bar.

User Case Study

Having created this explainable AI model and explanation interface, we became excited to put its efficacy to the test.

Specifically, we hypothesized that the explanation interface would facilitate the user’s understanding of the XAI system’s behavior while improving the user’s task performance by building a correct mental model of the AI and establishing appropriate trust and reliance on the system.

A successful XAI system should allow users to gain a better understanding of the system’s behavior and build a correct mental model of its operations. In this study, we use a series of prediction tasks and questionnaires to better understand the benefits of using an XAI system over a black-box one for mental model formation.

Study Design

The study consisted of 3 tasks for the user to perform with the help of A.I.

Clip Identify — the participant is prompted to pick up to 3 video clips from a set of 10 that they think are most relevant to the displayed keyword using a drag-and-drop interface.
Timeline Spot — the participant is asked to identify which segments of a longer video illustrate the displayed keyword.
User-Machine Collaboration Task — The user is given one of 7 manually-constructed scenario descriptions and must search the dataset for the video clip that best describes the scenario.

A screenshot of a Timeline Spot Task Trial with the XAI system after a participant has completed the trial. This image shows the video containing the specified keyword (top-left), the XAI generated evidence (bottom-left), the prediction/output summary (top-right), and the AI performance-assessment questionnaire (bottom-right). Note that the explanation interface described and shown previously was modified to facilitate each of the 3 task types.

At all times the user only had access to help from only one type of AI, our generative XAI model OR our discriminative “Black Box” AI model.

For the first two task types (Clip Identify and Timeline Spot), when the XAI model is available, the user is: 1) shown a list of model-generated supports that the XAI will base its decision on, then 2) asked to fill out a short questionnaire to assess their level of confidence and trust in the XAI system and the supports it generated, and 3) asked to predict the XAI’s suggestion based on the supports it generated. Only after the participant has filled out the questionnaire and predicted the XAI’s output are they given access to the XAI’s actual solution to the trial task.

For each task, the participant is given the opportunity to accept the AI’s answer as their own, and for the Clip Identify and Timeline Spot tasks, the participant is shown a summary that compares the correct answer, the participant’s answer, the A.I.’s answer, and (in the case of XAI) the participant’s prediction of the AI’s answer.

Finally, at the end of each trial task, the participant completes a questionnaire assessing their perception of the AI system’s performance.

The study tool itself measures the participants’ performance (both time and accuracy), their decisions on whether to elicit the AI’s help, their answers from the XAI task questionnaires, and some other miscellaneous metrics.

Study Results

With a total of 44 participants engaging in more than 350 distinct AI-assisted tasks, the collected data features completion time, user and AI accuracy, and user sentiments towards the AI systems and XAI-provided explanations.

Upon observing divergence in participant performance and reaction between those who explicitly stated low levels of trust in AI explanation and those who did not, we have decided to segment the study results into three separate task clusters:

Black-Box AI Tasks
XAI tasks completed by users with low levels of trust (XAI LOW)
XAI tasks completed by users who did not express explicit distrust (XAI HIGH)

Speed — Without the task segmentation, the XAI system presented a negligible advantage over the Black Box AI system. However, when the tasks are clustered into the three categories described above, a significant variation in completion time appears for the Timeline Spot task with XAI LOW boasting the lowest average completion time (37s), followed by XAI HIGH at 46s and Black Box AI tasks at 65s. This result suggests that the use of an explainable AI system helped users decide more quickly whether to trust the AI’s recommendations.

Accuracy — Accuracy is defined here as the portion of instances where the user, assisted by the AI system, was able to identify the correct answer in a single task trial. The accuracy was highest for the XAI HIGH cluster (74.0%), followed by Black Box AI (68.2%) and XAI LOW (44.4%). These results indicate that when trust is low, the user may assume the generated evidence is unreliable and proceed to submit their own (often incorrect) answer.

User-Machine Synchronization — defined here as the portion of instances where both the user and AI systems select the same answer (regardless of its accuracy), this measure indicates how often the user accepted the AI’s suggestion. In strong alignment with the accuracy results, XAI LOW tasks resulted in a significantly lower synchronization rate of 37.4% followed by XAI HIGH (58.6%) and Black Box AI (60.0%). These results may indicate that the provision of trustworthy evidence (XAI HIGH) does not help any more than no evidence (Black Box AI), but that the provision of untrustworthy evidence (XAI LOW) can actually drive users away from AI suggestions.

Results of accuracy (left), user-machine synchronization (middle), and user skepticism (right) across the three task clusters. Skepticism here refers to the portion of trials where the AI was incorrect and the user was still able to find their own *correct* answer. There was no significant difference in skepticism across the three task clusters.

Our findings ultimately did not support our initial hypothesis that our explainable AI interface would facilitate a better understanding of the XAI’s behavior and improve participants’ task performance compared to a Black Box AI system. We discovered that the XAI system yielded a comparable level of efficiency, accuracy, and user-machine synchronization as its black-box counterpart if the user exhibited a high level of trust for AI explanation. However, if the user exhibited low trust for the XAI explanation, then the accuracy and user-machine synchronization were far lower.

Furthermore, there was no significant indication that users were able to correctly accept or reject the XAI system’s assistance, and the results were largely comparable with the Black Box AI counterpart. These results suggest that users form trust and affinity for the XAI or AI system more or less based on instinct, and the system may produce video clips that ultimately result in correct answers, but don’t necessarily seem logical or comprehensible to human users.

Although the results of our study did not support our hypothesis, we did uncover some findings that suggest the potential for future XAI systems to increase user performance in the way we had hoped ours would. Furthermore, some of our other findings may be useful in informing further research and development on XAI systems. Here are some of those findings:

The amount of trust in the XAI’s evidence (as indicated in the mental model questionnaire) correlated highly with user-machine synchronization (XAI LOW cluster showed lower rates of user-machine synchronization versus the XAI HIGH cluster). This combined with the fact that individual users’ reactions to the XAI explanations tended to shift from one trial to the next suggests that users do actively respond to XAI explanations and change their opinions and behavior accordingly.
While the Black Box AI system originally seemed to yield higher overall satisfaction amongst the participants, there was a sharp divide in satisfaction between the participants with a high level of trust and reliance for the XAI system compared to those without. Upon further analysis, it was evident that the XAI system resulted in a more positive experience overall compared to the AI system, if the users had a high level of trust for and reliance on the system. Additionally, users who trusted the XAI explanations also performed better than those using the Black Box AI in some tasks.
In addition to collecting participant reactions to model-generated video clips as part of the study, five external data annotators (with no prior experience with the study) were hired to conduct a post-hoc analysis of the quality of the XAI explanations and were asked to rate how well the model-generated clips represented each query. This data was used as a proxy for the quality of the XAI explanations as well as an indicator of participant attentiveness throughout the study. There was no apparent correlation between participant reactions to individual clips and externally annotated ratings. This may indicate that clip quality assessment criteria differed between our experts and participants, or that overall clip quality did not strongly influence users’ trust or confidence in the AI agent. It was notable, however, that positive user reactions were clustered around clips that feature exaggerated motions and cartoon-like premises, such as “bear (human subject),” “salsa dance,” and “express joy,” while more generic and muted clips such as “pull up” and “walk and turn repeated” received negative reactions.
There was divergence between clusters of participants who found the XAI system to be reliable and influential to their decision-making processes, and those who deemed the system to be counter-intuitive and underwhelming. One participant wrote “[AI explanation] is a good basis of determining the reliability of AI in terms of [whether] the AI is able to detect the proper animations,” and another expressed satisfaction, stating “I’m impressed with what the AI system outputs.” However, some expressed caution and distrust, with one writing “the AI system often interpreted small portions of movements as if they met the definition of the keyword although it was a mere segment of the movement,” and another writing “I didn’t trust it completely as it detected similar movements and categorized it as the real one.” Two participants plainly wrote “I did not see the explanation,” alluding to the possibility that what constitutes an AI explanation may vary between individuals or may require additional training or clearer messaging to help people interpret generated clips as explanations.

In Conclusion…

Although our user study did not validate our explainable AI interface and our hypothesis that it would aid users in forming a correct mental model of the AI’s behavior and result in better task performance, we still believe that the XAI model and interface we created are an interesting example of XAI that functions and is presented in a way that average users can at least understand and attempt to make informed decisions from. Furthermore, we believe that our work here is an early step in examining, measuring, and dissolving levels of tension and distrust between human users and AI systems, and we hope that it will help pave the way to future research opportunities pertaining to human-in-the-loop AI systems.