Faculty of law blogs / UNIVERSITY OF OXFORD

Benchmarking Case Outcome Prediction for the UK Employment Tribunal: The CLC-UKET Dataset

Author(s)

Holli Sargeant
PhD Candidate, University of Cambridge
Felix Steffek
Professor of Law and J M Keynes Fellow at the University of Cambridge

Posted

Time to read

5 Minutes

Employment tribunals play a critical role in resolving disputes between employers and employees, yet the volume and complexity of cases create challenges for timely and consistent resolution. Predicting case outcomes through advanced AI can enhance access to justice, streamline legal processes and help stakeholders make better-informed decisions. In a recent paper published by the Association for Computational Linguistics in the Proceedings of the Natural Legal Language Processing Workshop 2024, Huiyuan Xie, Felix Steffek, Joana Ribeiro de Faria, Christine Carter and Jonathan Rutherford explore the intersection of technological innovation and access to justice, focusing on the development of benchmarks for predicting case outcomes within the UK Employment Tribunal (UKET).

Despite the potential benefits of predictive models in legal contexts, there remains a notable gap in available legal data that hampers AI advancements. Publicly accessible, comprehensive datasets are rare, particularly those that offer standardised annotations of legal decisions. Addressing this gap, the CLC-UKET dataset created as part of this project offers a solution by providing an extensive, curated collection of UKET cases, annotated and organised to enhance predictability and transparency within employment dispute resolution. 

The CLC-UKET dataset was curated from the Cambridge Law Corpus, compiling approximately 19,000 UKET cases. The dataset includes intricate legal annotations across multiple facets, making it a comprehensive resource for legal AI applications. Manual annotation by legal experts is a time-consuming and costly process. To alleviate this burden, we explored the use of large language models (LLMs) to automate the annotation process. By utilising LLMs, specifically the GPT-4-turbo model, we efficiently handled vast quantities of data without compromising on the accuracy or depth of information. Through an iterative approach to prompt design, we optimized the LLM’s performance for annotating the following details: (1) facts, (2) claims, (3) references to legal statutes, (4) references to precedents, (5) general case outcomes, (6) general case outcomes labelled as ‘claimant wins’, ‘claimant loses’, ‘claimant partly wins’, and ‘other’, (7) detailed orders and remedies and (8) reasons. We report on this process in more detail in another paper available on SSRN and arXiv.

The annotated CLC-UKET dataset allows for case outcome prediction, a challenging but valuable task in legal AI. Acknowledging discussion on task terminology, we use the term ‘prediction’ rather than ‘classification’ because we specifically focus on predicting case outcomes using only facts and claims, without including explicit outcome information in the input data. In this prediction task, given a set of case facts and claims, the model generates an outcome label that falls into one of four categories: ‘claimant wins’, ‘claimant loses’, ‘claimant partly wins’ or ‘other’. This task relies solely on the description of facts and claims, intentionally excluding any explicit details about the tribunal's final decision to test the model’s predictive capabilities based on input case summaries alone. To establish a baseline for model performance, human predictions were collected by providing experts access to the same facts and claims without the actual case outcomes. Comparing human predictions to model outputs is crucial for understanding the limitations and strengths of AI in this domain.

Four types of approaches were used to benchmark the dataset’s predictive potential. Each type offers a unique approach, and their comparative performances shed light on the effectiveness of model customisation for complex legal tasks.

1. Performance of Finetuned Transformer Models

  • Highest F-scores Overall: Among all models, finetuned transformer models, particularly T5, achieved the best results, showing superior accuracy in predicting outcomes. The T5 model displayed the highest F-scores across most categories, highlighting the advantage of training models specifically on the CLC-UKET dataset.
  • Precision and Recall Strengths: The T5 model achieved strong precision and recall scores across the categories of ‘claimant wins’ and ‘claimant loses.’ For instance, T5 attained an F-score of 0.650 for ‘claimant wins’ and 0.734 for ‘claimant loses’. This accuracy underscores how model fine-tuning on specific legal annotations can enhance precision in interpreting complex tribunal judgments.
  • Gaps in Specific Categories: Despite its overall performance, the T5 model struggled with the categories ‘claimant partly wins’ and ‘other’, where it achieved low F-scores. The ‘other’ category in particular yielded an F-score of zero, suggesting that even advanced models face challenges with underrepresented or very complex outcomes. This outcome indicates that finer distinctions in nuanced cases may require additional tailored training or refined annotation strategies.

2. Comparative Analysis of GPT-3.5 and GPT-4 Models

  • Small but Notable Improvements with GPT-4: Between the two GPT-based models, GPT-4 consistently outperformed GPT-3.5, although the margin was relatively small. This improvement highlights the incremental advancements in newer LLM versions and how refined language models contribute to higher accuracy in complex legal tasks.
  • Impact of Few-shot Examples on GPT-3.5’s Accuracy: Interestingly, incorporating task-specific few-shot examples significantly enhanced GPT-3.5’s performance. For instance, using few-shot examples that matched the legal area of the target case improved its F-score in outcome prediction more effectively than randomly sampled examples. This result emphasises the importance of contextual relevance when leveraging few-shot learning, especially in specialised fields like legal AI where case-specific nuances matter.
  • GPT-4 Zero-shot Precision: Notably, GPT-4 achieved the highest precision in its zero-shot setting among all baseline models, indicating that it can accurately predict outcomes without task-specific fine-tuning when given the right context. Providing task-related examples in few-shot settings (specifically the ’juris-2’ setting, where two examples from similar legal areas were provided) boosted GPT-4’s F-score. However, the relatively modest gains suggest that simply adding more examples does not drastically improve performance, pointing to a need for high-quality, highly relevant few-shot examples.

3. Benchmarking Against Human Expert Predictions

  • Human Predictions Outperform AI: A critical reference point for the model’s efficacy was human expert predictions, which outperformed the AI models by an approximately 19% higher F-score over the best-performing model, T5. This gap highlights the value of human expertise in interpreting legal nuances that current AI models struggle to replicate.
  • Strength in Judgment-based Decisions: Human expert annotators demonstrated the highest F-scores for both ‘claimant wins’ and ‘claimant loses’ categories, indicating that the subjective analysis of case nuances may require human interpretation that AI has yet to achieve. On the other hand, GPT-4 outperformed the human experts when predicting ‘claimant partly wins’ and ‘other’, i.e., in more complex cases.

4. Benchmarking Hard Cases

  • Predicting Hard Cases: The Human experts were asked to identify those cases that they considered as hard to predict. This allowed comparing the models’ and human performance as regards hard cases. As expected, both AI models and human experts achieved worse scores for hard cases.
  • Finetuned Transformer Models are Best in Predicting Hard Cases: Interestingly, the finetuned transformer models, in particular T5, outperformed both the GPT-based models and the human experts in predicting hard cases.

Whilst the study provides valuable insights into the prediction of dispute outcomes for the UK Employment Tribunal, it is important to acknowledge certain limitations of our findings. First, information leakage, one example of bias in legal data, may arise from using LLM summaries of judge written case judgments as we are unable to use neutral information. This information might reflect the judges’ post-hoc knowledge and subjective perspectives that shape their written judgments and any information leakage from the LLM summary. Second, while GPT-4 was used for efficient annotation, automated extraction may contain minor inaccuracies, and more detailed factual data could improve predictions. Finally, the dataset spans 2011-2023, during which legal rules and principles evolved, possibly affecting model accuracy over time, as decision dates were indirectly inferred. Future research will address these aspects for more robust prediction models.

The CLC-UKET dataset establishes a meaningful benchmark in legal AI, offering a robust resource for advancing outcome prediction in employment tribunals. Access to the CLC-UKET dataset is available through the Cambridge Law Corpus. While AI models demonstrate promising accuracy, particularly with fine-tuning, human expertise still outshines AI in relevant areas. As we move forward, exploring ways to bridge this gap and improve AI’s adaptability will be key to realizing a future where predictive AI and human judgment work seamlessly to enhance access to justice.

This project received funding support from the Cambridge Centre for Data-Driven Discovery and Accelerate Programme for Scientific Discovery, made possible by a donation from Schmidt Futures.

Holli Sargeant is PhD Candidate, University of Cambridge.

Felix Steffek is Professor of Law, University of Cambridge.

Share

With the support of