MTG Posting: Reviewing ML Approaches to Magic: The Gathering

I have always enjoyed building Magic: The Gathering decks just as much as, if not more than, I like actually playing the game. While it’s always fun to thumb through the 10-15 special cards that I pulled out to maybe build a deck around in the future, my deckbuilding efforts normally flounder somewhere between flipping through EDHREC, trying to find cards in my collection that I’m pretty sure I have, and inevitably filing a 30 card stack into a deckbox with a label like “Graveyard Stuff WIP”. Eventually we all persevere through these kinds of false starts, but I have always wanted some means to understand what is possible in my current collection at any given time. The endless hours of scanning cards with my phone hasn’t gotten me much closer to this solution using existing deckbuilding sites.

Many of these sites are fantastic at what they do, especially EDHREC, Archidekt, and Moxfield. Their only limitation is that they do not attempt to guide, prescribe, or predict the actual deckbuilding process – they convey the outcomes of many players’ individual deckbuilding processes (ie. uploaded decklists) into a statistical approximation of complementary sets of cards. This analytical approach is very effective at saying “what cards should I include if I want this specific creature as my commander?”, but it does not tell us what our collection is capable of building. The latter involves a layer of introspection about the cards we already own, a consideration of cards they are similar and complementary to, and an approach to mediate the concrete reality of what is available to us relative to what we want to build thematically and strategically.

Literature Review & Problem Statement

When I first started reading for this project, I was initially surprised by the volume of academic literature talking about deckbuilding and drafting in Magic the Gathering. After I skimmed a couple of papers, it was clear that I was tackling a meaningfully complex problem. They had titles like Magic: The Gathering is Turing Complete and made a series of mathematical claims that I do not totally understand, but I was assured that computers will have an extremely difficult time solving this game.

Luckily, I am not interested in building some kind of rules engine that exhausts all possible outcomes for an infinite number of board states. I am only interested in turning my big box of Magic cards into well-constructed commander decks.

Deckbuilding Ergonomics

What is synergy?

When building a commander deck, most people use their selected commander as a thematic and mechanical anchor for the rest of their selections. With a general theme in mind, players attempt to fulfill some basic meta-requirements (card draw, ramp, removal, etc.) while making selections that conform with some intended style of play.

The term “theme” is doing a lot of heavy lifting for a topic that has received serious consideration from more serious people than myself. Kritz & Gaina attempt to provide some rigor to the term synergy, which is often thrown around to describe how well constructed decks are greater than the sum of their parts in Gestaltian sense.

“Synergy is a set made up of two or more game elements that have a measurable value and where the value of the set is different from the sum of the individual outcomes of its elements.”

This is a neat way to smoosh down all of the variables that might make set S of cards a fun to play core strategy that flows well while set S-1 or S+1 could very well be a slog where you feel like you have no plays. This is obviously hyperbole, but it presents a question of how fine-grained we can get with predicting the positive outcomes associated with all possible additions to an existing set of cards.

Kritz & Gaina attempt to quantify synergy as cards that are most likely to result in a high win rate when played together. Since this can’t be generalized to new sets, whose cards are valid deckbuilding choices but there is no game data for them yet, they use the strength of a card as a sort of proxy for its ability to contribute towards a win. Specifically, they define strength as the ratio of how much mana it costs to cast the card relative to its ability to cause damage to an opposing player.

The proposal is then to calculate this ratio for every combination of cards in sets of 60, resulting in “a number of possible synergy sets on the order of 10115, which is higher than the number of atoms in the universe”. This is presented as an obvious impossibility, implicitly urging other researchers to focus their efforts toward less categorical means of describing cards.

The Problems of a Large Card Pool

With 37,000+ unique Magic: The Gathering cards printed over 30+ years, players have a massive set of possible choices relative to a given commander selection and must employ goal-appropriate heuristics to reduce that search space. These heuristics differ from player to player and deck to deck, but include considering both categorical characteristics of a card (mana cost, color, type, rule text, power, toughness, etc.) and the predicted emergent characteristics of that card relative to the current Chosen Set.

Drafting is a similar activity to deckbuilding, but it takes place in a context where players have a limited card pool (single set) and a good understanding of the playstyle archetypes available to them. Ward et al describe drafting as a “a stochastic descent towards powerful deck configurations where players try to pick cards that are both individually strong and synergize with each other.” They go on to say that “players aren’t constructing a goal, they’re converging on one of a pre-enumerated set of intended configurations that the set was designed to contain.” They are able to make this claim in the context of drafting because the inherent set limitations (~250-300 unique cards per set, all built around intentional archetypes) create a more concrete picture of possible synergies in the potential card pool. This archetypal guidance is quite literal since Wizards of the Coasts includes draft cards in Play Booster Boxes that outline the various themes in the set’s cards, divided along color identity, tribal, and mechanical lines.

A Magic: The Gathering draft archetypes card

A smaller pool of possible choices and intentionally themed content make approaches based solely on data depicting drafting decisions potentially tricky to apply to commander deckbuilding tasks, where the legal card pool is well over 21,000 unique cards spread across basically every set in the history of the game.

A merfolk tribal deck consisting exclusively of Lost Caverns of Ixalan cards is going to focus primarily on the Explore mechanic because that was a specific theme of that set. A tribal merfolk commander deck with access to basically every set can be geared around a large set of possible archetypes, with unclear levels of complementarity between those intentional clusters. If we were to make commander deckbuilding recommendations based purely on co-occurrence gleaned from drafting data, we would have no way to identify complementarity across sets. This is a major concern for our project if we want to re-implement some or all of the approaches found in academic literature.

The question of what makes a deck synergistic isn’t exactly the focus of most of the papers I reviewed. Instead, they are focused on how to mine evidence of human decision making around drafting and deckbuilding with the assumption that the empirically most common combinations of card selections are the most complementary. These approaches use either detailed draft selection data or a database of decklists scraped from sites where players upload them, overlaying the sets of cards on top of each other and in relation to one another until clear inter-relationships form.

Their focus is on developing data analysis pipelines that try to get the best approximation of statistical co-occurence, but rarely ask any questions about why, or if, those are good pairings.

What is complementarity?

Some closer reading reveals a range of strategies to answering pertinent questions of card complementarity from different angles. I think it is easiest to frame each of these data science approaches as answering a specific question about some set of cards.

What is the probability that card A and card B appear in the same deck?

This question does not have a paper associated with it because it is the implicit baseline of any analysis of complementarity. If we collect a large number of decklists, it seems to make sense that two cards that show up in decks together frequently would be complementary. Otherwise, why would players play them together so often?

The issue here is that there is a substantial list of cards that are considered “staples” in the commander format. These cards are compatible with most color identities and deck strategies because they offer general utility and are often colorless. A quick glance at the EDHREC top cards page shows several cards included in over 50% of all decklists ingested by the site (up to 86% for Sol Ring). This is an astronomical number that washes out meaningfully synergistic pairs in favor of overall popularity and generic utility.

If I already have card A in my deck, what is the probability that I will choose card B to add to that deck?

Ward et al. addressed this question in AI Solutions for Drafting in Magic: the Gathering by examining data from 100,000 drafts conducted by real players. For each card selection that a player makes, the data details their current set of already selected cards, the possible choices that are left in the current pack, and the card that the user ultimately selected for that round. This paper conducted a bakeoff between different types of drafting agents that can approximate the decisions that real humans would make while drafting, but we will only examine two in detail.

The first one I want to discuss is what they refer to as a “Naive Bayes agent” and is an intentionally shallow approach to complementarity. It improves on the initial baseline approach by injecting a sense of directionality in pairings of complementary cards. It does so by considering whether a player is likely to draft a given card when it is offered up to a player in a pack, relative to the current contents of their deck. In other words, we are asking: If card A is in the set of possible choices and card B is in my deck, how likely is it that I will actually select card A? For every recorded choice, they plot a separate pairing from card A to every card currently in the deck. When the drafting agent is making a draft choice, it picks the card in the pack with the highest aggregate probability relative to all of the cards in the current deck. For first round picks where there are no cards in the current deck, the agent selects whatever available card was most frequently selected first by humans.

While set-limited drafts do not suffer from the same degree of the staple card bias problem that commander deckbuilding does, the Naive Bayes agent applied to full commander decks would. Since Sol Ring is included in 86% of decks on EDHREC, P(Card A → Sol Ring) would wash out the signal from the other cards in the deck and basically any “Card A” seems like a decent choice.

The approach needs some way to consider cards with some special statistical (and, in this case, mechanical) place in the card pool in a larger context. A card like Sol Ring might always be included in your decks regardless of color identity, theme, or strategy, but it is not “synergistic” with any particular selected cards as much as it is just globally useful in a purely mechanical way. We have not yet discussed any approaches that treat cards as anything other than binary included/not included encodings.

How much more often do A and B appear together than their individual popularities alone would predict?

The ultimate goal of the Ward et al paper was to introduce a co-occurrence solution that directly addresses staple bias. They do so by lifting up co-occurrence pairs that are statistically abnormal once you consider how often each of them occur in all decks. If a staple card like Sol Ring appears in 90% of decks, it should co-occur with a card that appears in 10% of decks about 9% of the time by pure statistical chance. If card B appears in more than 9% of the decks that Sol Ring is included in, we could say that there is some potential complementarity between the two cards.

They define lift as lift(A,B) = P(A,B) / (P(A) × P(B)) where a lift value greater than 1.0 means the two cards appear together more often than their individual rates would predict. A value of 1.0 means they’re independent. Values below 1.0 mean that the two cards rarely appear together even if they are both popular, providing a very strong negative signal for what they call “synergy plots”.

[Screenshot of synergy plots]

This approach begins to address the staple bias problem, but requires a large corpus of decklists to filter out the noise of rarely included cards that happen to co-occur with the same card each time. On the other hand, its symmetric nature – lift(A,B) = lift(B,A) – divides out the individual popularities of both cards, so a universally-played card doesn’t automatically get high lift with everything.

Given a starting set of cards, would Card A or Card B be more complementary?

The strongest formulation of this problem comes from Bertram et al.’s Contextual Preference Ranking (CPR) model. Their core claim is that drafting is fundamentally a “set addition problem” where the value of a card depends on the cards already chosen. This is much closer to the actual deckbuilding question than raw co-occurrence, conditional probability, or pairwise lift. We are no longer asking whether two cards tend to appear together. We are asking whether a candidate card is the better next addition to a partially constructed deck.

CPR operationalizes that question with a triplet Siamese network. The anchor is the current draft pool, the positive example is the card the human player actually selected, and the negative example is a card they passed over. The model learns an embedding space where preferred additions are closer to the current pool than rejected alternatives, using Euclidean distance and triplet loss. This distinction matters: the learned distance is not a generic similarity score between two cards. It is an estimate of how well a card complements the existing set.

The results are also much stronger than the earlier drafting agents. Evaluated on the DraftSim dataset of 107,949 human drafts from Throne of Eldraine, the SiameseBot with embedding dimension D=256 achieved 83.78% Mean Top-Two Accuracy. The previously discussed approaches were much lower: NNetBot at 48.67%, DraftsimBot at 44.54%, and BayesBot at 43.35%. For the purpose of this project, the important thing is not just that CPR performs better, but that it asks the right kind of question. It treats a card’s value as contextual rather than intrinsic.

This also gives us a useful way to think about card power. A first pick is just the degenerate case where the context is an empty set. CPR’s observation that a card’s distance to the empty set strongly correlates with human first-pick rates suggests that the model can learn something like an unsupervised card power ranking. But once the draft pool is no longer empty, that global ranking should give way to contextual fit. A generically strong card may be the right first pick and the wrong fifteenth pick if the rest of the deck is already moving in a different direction.

UrzaGPT approaches the same broad problem from the opposite direction. Rather than building an explicit metric-learning architecture around triplets, it fine-tunes a general language model on human draft logs. Its core contribution is the application of Low-Rank Adaptation (LoRA) to 1.1 million picks from Kamigawa: Neon Dynasty drafts. After 10,000 training steps, a fine-tuned Llama-3-8B reached 66.2% accuracy in predicting human picks.

That is an impressive result, but it also highlights the tradeoff between learned drafting behavior and usable deckbuilding explanation. A fine-tuned language model can absorb a large amount of card text, pack context, and human preference data, but it does not automatically produce a stable complementarity score between a starting set and a candidate card. CPR is narrower, but its geometry is directly useful: given a current deck state, candidate cards can be ranked by distance to that context. For collection-aware commander deckbuilding, that is the more transferable insight. We need a model that can say “this card is a better next addition to this specific pile of cards,” not merely “this card is generally powerful” or “players often put these two cards in the same deck.”

The limitation is scale and domain transfer. CPR was evaluated on drafts from a single set, where the card pool is small and the archetypes are intentionally designed. Commander deckbuilding has no comparable boundary. The legal card pool spans decades of design, many mechanics were never intended to interact, and a 100-card singleton deck can pursue themes that no draft environment would ever support. A CPR-style model points in the right conceptual direction, but we would still need to adapt it to a much larger pool, richer card representations, and a messier definition of what counts as a “good” deckbuilding decision.

How similar are Card A and Card B?

Embed cards as vectors based on their rules text + type + features, then measure how close two cards are in that vector space.

We only use this for suggesting replacement cards because similarity is distinct from complementarity. The latter is the primary concern of most deckbuilding tasks since a deck is composed of a mutually compatible set of mechanics that amount to a synergistic strategy. If you have a commander that places a +1/+1 counter on a creature you control when some condition occurs, but no cards to ever trigger that ability, your deck would not be very effective.

The Tooling is Cool, But Not Enough

While players manually make the ultimate card selections, the intractability of the search space reduction problem for hobbyist players with thousands of physical cards in their collections has led to the emergence of dozens of web applications, subscription services, mobile apps, and desktop applications meant to track your collection across paper and digital formats. Many of these websites and applications also provide tools, data, and user interfaces to explore what other players are including in their decklists.

Most of these sites revolve around making recommendations of complementary cards relative to some starting point. A single selection like a specific commander, a theme, or a tag yields a large list of decklists or individual cards that are related to the initial selection. One of the most popular deckbuilding sites, EDHREC, collects massive numbers of decklists uploaded by players and suggests cards that frequently co-occur with an initial selection. Players then scroll through that initial explosion of choices from a single point and pivot to additional lists from the original. The site’s developers have devised a series of logical checks that ensure that staple cards that are common in basically every commander deck do not wash out the co-occurrence signal for more specifically complementary cards.

This design directly mimics looking through an endless set of individual decklists uploaded by other players, but with the scale of machine web scraping and well-tuned statistical analysis. In fact, EDHREC actually ingests huge numbers of decklists from sites, like Moxfield and Archidekt, where players would likely go themselves. Instead of browsing a single decklist and determining that two cards are complementary on your own, EDHREC presents you with an aggregated meta-decklist assembled from thousands of decks with the same commander.

To square this within our literature review, EDHREC is essentially answering the question of: “How much more often do A and B appear together than their individual popularities alone would predict?”. While I am sure that their implementation has lots of deviations from the public example, you can clearly see the application of lift to reduce the noise from commander staple cards. The sheer size of the decklist database they have built allows the site to handle the staple bias problem in co-occurrency systems partially through sample size alone.

This is a fundamentally sound approach that produces suggestions that clearly resonate with Magic the Gathering players, but I believe that we can do much more to more intelligently consider the cards that players already own in their collection and make complementarity claims based on more than just co-occurrence. Specifically, we can augment pure co-occurrence data with additional data sources and analysis methods to more thoughtfully consider the full range of deckbuilding possibilities from some already chosen set of cards.

Holzwege

Literature Review & Problem Statement

Deckbuilding Ergonomics

What is synergy?

The Problems of a Large Card Pool

What is complementarity?

The Tooling is Cool, But Not Enough