The mAP is one of the most popular yet most complex metrics to understand. It stands for mean average precision, and is widely used to summarize the performance of an object detector. If you've ever played with a detection model, you have probably seen this table before:
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.309
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.519
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.327
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.173
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.462
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.547
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.297
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.456
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.511
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.376
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.686
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.686
This is COCO's version of mAP (or AP as they call it). But what does it even mean? What's up with all those symbols? In this post we'll walk you through all the theory necessary to help you, not only interpret the information in the table, but understand the need for such a complex metric.
What is a "Good" Object Detector?
When we measure the quality of an object detector, we mainly want to evaluate two criteria:
The model predicted the correct class for the object.
The predicted bounding box is close enough to the ground truth.
As you can see, when combining both, things start to get murky. Compared to other machine learning tasks, like classification, there's no clear definition of "correct prediction". This is specially true with the bounding box criterion.
Take, for example, the images below. Which ones would you say are correct and which ones incorrect?
Don't worry if you're struggling making a decision, it is a tough problem! And it becomes even more cumbersome. We might also find it interesting to evaluate the detector against object of different sizes. A detector may struggle with small objects, but excel with big ones. Another detector may not be so "good" overall, but have better performance on small objects. How about the amount of objects a model is able to detect simultaneously? Not so simple as you can see. That's precisely what the COCO evaluation metrics aim to evaluate.
COCO evaluation metrics provide a standard to measure the performance of an object detector under different, well stablished, scenarios.
The COCO evaluation extends beyond the scope of object detection, and provides metrics for segmentation, keypoint detection, etc... but that's a topic for another read.
IoU: Intersection over Union
The journey to understand mAP starts with the IoU. The Intersection over Union is a measurement of how well two bounding boxes align together. It is typically used to measure the quality of a predicted box against the ground truth.
The IoU, as its name suggests, is defined as the intersection of the two boxes, divided by the union of both:
Graphically, you may understand the intersection and the union as:
Let's analyze for a moment the equation. A few points are worth mentioning:
The union will always be bigger (or equal) than the intersection.
If the boxes don't overlap the intersection will be zero, making the IoU=0.
If the boxes overlap perfectly, the intersection will match the union, making the IoU=1.
The IoU will always be a value between 0 and 1, inclusively.
The bigger the IoU, the better!
You may sometimes hear the IoU being referred to using a fancier name: Jaccard Index or Jaccard Similarity. In the context of object detection, they are the same. The Jaccard Index is a more general mathematical form of comparing the similarity between two finite sets.
In Python, the IoU could be computed as:
def iou(bbox_a, xbox_b):
ax1, ay1, ax2, ay2 = bbox_a
bx1, by1, bx2, by2 = bbox_b
# Compute the coordinates of the intersection
ix1 = max(ax1, bx1)
iy1 = max(ay1, by1)
ix2 = min(ax2, bx2)
iy2 = min(ay2, by2)
# Compute the area of intersection rectangle
intersection = max(0, ix2 - ix1) * max(0, iy2 - iy1)
# Compute the area of both bounding boxes
box1_area = (ax2 - ax1) * (ay2 - ay1)
box2_area = (bx2 - bx1) * (by2 - by1)
# Finally compute the union of the areas
union = box1_area + box2_area - intersection
return intersection / union
So now we have a number that describes how good was a predicted bounding box, compared to a ground truth. But how bad can the IoU get before we discard the prediction?
IoU as a Detection Threshold
Take a look again at the cat prediction examples. It is clear that the one at the far right is off, while the one on the left is acceptable. How did our brains decide this so quickly? How about the one in the middle? If we are comparing models performance, we cannot leave this decision subjective.
The IoU can serve as a threshold to accept or not a prediction.
You'll see this threshold specified as IoU@0.5, which simply means: "only the bounding boxes with an IoU greater or equal than 0.5 (or 50%) with respect to the ground truth were taken as correct.
In the literature it is typical to encounter the following IoU thresholds:
IoU@0.5
IoU@0.75
IoU@0.95
IoU@[0.5:0.05:0.95]
The first three should be clear by now. The last one, although confusing, is easy to understand. It is referring to multiple thresholds and it simply means: "all the IoU thresholds from 0.5 to 0.9, using a step of 0.05". If you expand it, you'll see that this notation accounts for 10 different IoU thresholds:
IoU@[0.5:0.05:0.95] = IoU@0.5, IoU@0.55, IoU@0.6, ..., IoU@0.85, IoU@0.9, IoU@0.95
This expression has become very popular thanks to COCO. Sometimes authors abuse the notation and skip the step, simply writing IoU@[0.5:0.95]. While confusing, it is almost certain that they are referring to the ten IoU steps described above. We'll see how to apply multiple thresholds later on.
Speaking of shorthands:
When evaluating an object detector, if the author doesn't specify an IoU threshold, it is almost always implied IoU=0.5
Finally, you might be thinking that 0.5 is way too low of a threshold to be chosen as the default. Well, that's how hard of a problem object detection is!
True / False Positives and True / False Negatives
The next concepts to understand are TP (True Positive), FP (False Positive), TN (True Negative) and FN (False Negative). These terms are borrowed from the binary classification tasks. The following table, for a hypothetical apple classifier, summarizes them:
A good classifier has many True Positive and True Negative predictions, while minimizing the False Positives and False Negatives.
Multi-class Classifiers
The concept presented above doesn't quite fit if we have a multi-class classifier. The binary classifier answers "is this object of this class?", while the multi-class classifier answers "to which of these classes this object belongs to?". Fortunately, we can extrapolate it using a One-vs-All or One-vs-Rest approach. The idea is simple: we evaluate each class individually, and treat is as a binary classifier. Then, the following holds:
Positive: the class in question was correctly predicted for the object
Negative: any other class was predicted for the object
Here's the same summary table viewed from the 🍎 class perspective.
It is uncommon to use the True Negatives qualifier when talking about multi-class models, but it would be something like: "the samples that where correctly identified as being from other classes".
Now let's do a similar exercise from the banana perspective.
Did you notice what happened? The (🍌, 🍎) combination is a FP from the apple-class perspective, but a FN from the banana-class perspective. A similar situation happens with the (🍎, 🍌) combination, it is a FN from the apple-class perspective but a FP from the banana-class perspective. This overlap is expected and typical of a multi-class model!
In Object Detectors
The remaining question is: how do I apply the concepts above to an object detector. Detectors are typically multi-class classifiers, but also have an object localization factor to them. Fortunately, we already have the IoU metric at our disposal!
At this point, we can jump straight to the summary. Let's assume IoU=50%
True Positive
Predicted bounding box has an IoU above 50% with respect to ground truth. and
Predicted class matches the ground truth.
False Positive
Predicted bounding box has an IoU below 50% with respect to ground truth or
There is no associated ground truth or
Predicted class does not match the ground truth.
False Negative
Every ground truth that doesn't have a matching prediction.
Again, True Negatives are not an interesting case in this scenario.
Ambiguous Examples
In the following image there are two ground truths, but one single prediction. In this case, the prediction can only be attributed to one of the ground truths (the IoU is above 50%). The other ground truth, unfortunately, becomes a False Negative.
In this next example, two predictions were made for the same ground truth. However both of them have a poor IoU below 50%. In this case the two predictions are counted as False Positives and the orphan ground truth is a False Negative.
In this last example, we have a similar scenario, except both predictions have a good IoU (above 50%). In this case, the one with the best IoU is considered a True Positive, while the other one becomes a False Positive.
Precision, Recall and F1 Score
Now that we know how to apply TP, FP, TN and FN to object detectors, we can borrow other metrics from the classifiers. These are the Precision, Recall and F1 Score. Again, these metrics are measured by class. The following table summarizes them:
Precision
For a given class, the precision tells us what percentage of the class predictions were actually from that class. The following image shows the result of a detector, where there were 3 True Positives and 1 False Positive, resulting in a precision of:
Keep an eye on precision if you care about your model not giving False Positives. For example: its preferable to miss an increase in the stock market, than mistaking an increase for a decrease.
The astute reader might have noticed that the precision formula does not take False Negatives into account. Paying attention to precision only can be very misleading, as the following example shows:
As you can see, the precision metric resulted in a 100%, but the model is performing poorly, because it has lots of false negatives.
Do not measure precision by itself, as False Negatives are not taken into account.
Recall
For a given class, the recall (or sensitivity) tells us what percentage of the actual class instances where correctly predicted. In this other image, the detector correctly predicted 3 class instances (True Positives) but 3 instances were not predicted at all (False Negatives):
Keep an eye on recall if you want to avoid False Negatives. For example: its preferable to mis-diagnose cancer than predicting the patient is healthy erroneously.
Again, it can be noticed that the recall does not take the False Positives into account, so it can be misleading if measured in isolation:
As you can see, the recall metric resulted in a 100%, but the model is performing poorly, because it has lots of false positives.
Do not measure recall by itself, as False Positives are not taken into account.
F1 Score
The F Score or F1 Score is a metrics that combines both precision and recall, and gives us a nice balance between them. For the last time, using the same image as above, the model would return an F1 Score of:
The F-Score measures a balance between precision and recall.
PR (Precision-Recall) Curve
So far, we've seen that precision and recall characterize different aspects of a model. In some scenarios it is more convenient to have a higher precision, and in others it is more convenient to have a higher recall. You can tweak these metrics by tuning the detector's object confidence threshold.
Turns out there is a very convenient way to visualize the response of the model to a specific class at different classification thresholds. This is the precision-recall curve, and is shown in the following figure:
The process of creating a PR curve is:
Start by setting your confidence level to 1 and the initial precision to 1. The recall will be 0 (if you do the math). Mark this point in the curve.
Start decreasing the confidence until you get the first detection. Compute the precision and the recall. The precision will be 1 again (because you have no false positives). Mark this point in the curve.
Continue to decrease the threshold until a new detection occurs. Mark this point in the curve.
Repeat until the threshold is low enough for recall to be 1. At that point the precision will be probably around 0.5.
The following image shows the same plot with the threshold evaluated at some points.
It can be seen that the shape of this curve can be used to describe the performance of the model. The following figure shows a "baseline" classifier and a "perfect" classifier. The closer the classifier to the "perfect" curve, the better.
Practical PR Curve Algorithm
In reality, there is a more practical way to compute the PR curve. While it may seem counterintuitive, the result is the same as the steps above, except is more script-friendly:
Order all predictions from all images from the highest confidence to the lowest.
For each prediction, compute the TP, FP and an accumulated count of TP and FP (the sum of all the previous TP and FP).
For each prediction, compute the precision and recall using the accumulated TP and FP.
Make a scatter plot of the resulting precision and recall.
The following table shows the result of running the algorithm on a small dataset of 15 objects:
Running table to compute the PR Curve. Taken from here.
Prediction ID | Prediction Confidence | TP | FP | Acc TP | Acc FP | Precision | Recall |
R | 95% | 1 | 0 | 1 | 0 | 1.000 | 0.067 |
Y | 95% | 0 | 1 | 1 | 1 | 0.500 | 0.067 |
J | 91% | 1 | 0 | 2 | 1 | 0.667 | 0.133 |
A | 88% | 0 | 1 | 2 | 2 | 0.500 | 0.133 |
U | 84% | 0 | 1 | 2 | 3 | 0.400 | 0.133 |
C | 80% | 0 | 1 | 2 | 4 | 0.333 | 0.133 |
M | 78% | 0 | 1 | 2 | 5 | 0.286 | 0.133 |
F | 74% | 0 | 1 | 2 | 6 | 0.250 | 0.133 |
D | 71% | 0 | 1 | 2 | 7 | 0.222 | 0.133 |
B | 70% | 1 | 0 | 3 | 7 | 0.300 | 0.200 |
H | 67% | 0 | 1 | 3 | 8 | 0.273 | 0.200 |
P | 62% | 1 | 0 | 4 | 8 | 0.333 | 0.267 |
E | 54% | 1 | 0 | 5 | 8 | 0.385 | 0.333 |
X | 48% | 1 | 0 | 6 | 8 | 0.429 | 0.400 |
N | 45% | 0 | 1 | 6 | 9 | 0.400 | 0.400 |
T | 45% | 0 | 1 | 6 | 10 | 0.375 | 0.400 |
K | 44% | 0 | 1 | 6 | 11 | 0.353 | 0.400 |
Q | 44% | 0 | 1 | 6 | 12 | 0.333 | 0.400 |
V | 43% | 0 | 1 | 6 | 13 | 0.316 | 0.400 |
I | 38% | 0 | 1 | 6 | 14 | 0.300 | 0.400 |
L | 35% | 0 | 1 | 6 | 15 | 0.286 | 0.400 |
S | 23% | 0 | 1 | 6 | 16 | 0.273 | 0.400 |
G | 28% | 1 | 0 | 7 | 16 | 0.304 | 0.467 |
O | 14% | 0 | 1 | 7 | 17 | 0.292 | 0.467 |
For example, if we were to compute the precision and recall of the detection with id K (remember that the dataset has 15 ground truth objects, meaning TP+FN=15):
The resulting PR curve is the one we've been showing previously. Here's the plot again for convenience:
AP - Average Precision
So far, we know how to create a PR curve per class. It's time to compute the average precision.
The AP is the area under the curve (AUC) of the PR curve.
However, as you might've noticed, the irregular shape and spikes of the PR curve can make it pretty hard to compute this area. As such, COCO has defined an 11-point interpolation that makes the calculation simpler.
COCO 11-Point Interpolation
The 11-point interpolation for a given class C, consists of three steps:
Define 11 equally spaced recall evaluation points: [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9 1.0].
For each recall evaluation point r, find the highest precision p, from all recalls r' >= r. Or mathematically:
Graphically, this is simply a way to remove the spikes from the graph, to look something like:
3. Average the obtained precision at these recall evaluation points to obtain the average precision for the lass AP_C:
Lets take the same graph as an example and compute the AP.
COCO mAP - Mean Average Precision
We are finally ready to compute the mAP, the metric that this post is all about. In the last section we computed the AP for a given class. The mAP is nothing more that the mean of the AP for each class. In other words:
Here, #C is the number of classes. So, in a sense, the mAP is an average of an average of the precision. Now let's get into the specifics of the report COCO generates.
COCO does not make a distinction between AP and mAP, but conceptually they are referring to mAP.
mAP for IoU Thresholds
The first part of the report looks like:
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.309
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.519
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.327
Let's focus on the IoU for now.
The IoU@0.50 and IoU@0.75 are trivial. As explained in the IoU section these are just the system evaluated at two different IoU thresholds: 50% and 75% respectively. Of course, you'd expect 0.75 to be more lower than 0.5, because it's more penalizing as it requires a better bounding box match.
The IoU@[0.50:0.95] we've seen that expands to 10 different IoU thresholds: 0.5, 0.55, 0.6, ..., 0.95. From there, we compute the mAP for each and average them. This is typically the most penalizing metric, and hence the default.
Funnily enough, these becomes an average of the average of the average!
mAP for Object Sizes
The next part of the report reads like:
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.173
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.462
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.547
Let's focus on the area portion now. These serve as filters to be able to measure the performance of the detector on objects of different size ranges.
The small objects have an area of [0², 32²[
The medium objects have an area of [32², 96²[
The large objects have an area of [96², ∞[
The area is measured in pixels and ∞ is, for practical reasons, defined as 1e5². So the report for area=small will present results only taking into accounts objects with areas within the specified range.
AR - Average Recall
Finally, besides mAP (or AP), COCO also presents us with AR. As you may imagine, the calculation is the same, except everything is computed from the Recall perspective. Besides from the regular report lines, it's worth mentioning these ones:
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.297
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.456
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.511
As you can see, here they vary a maxDets parameter, which controls the maximum number of possible detections per image, used to perform the precision and recall calculations. A 100 maximum detection may sound like a lot, but remember that in the PR-Curve you are varying the confidence threshold from 0 to 1, and in the lowest values you may get a lot of false positives.
It's worth mentioning that the maxDets sweep is only interesting from the recall perspective. This is true because the precision measures the accuracy of the detections, while the recall measures the model's ability to detect all relevant instances in the dataset, which is affected by the amount of potential detections.
COCO Evaluation API
By now, hopefully it's clear how the different flavors of the mAP are computed and, more importantly, what they mean. COCO gives a Python modules that is capable of computing all these metrics under the hood and is, in fact, the official way of presenting results for their competition tasks.
This example shows how to use the COCO eval API:
# To download a sample dataset
import requests
import zipfile
from tqdm import tqdm
# An example model
from ultralytics import YOLO
# COCO tools
from pycocotools.coco import COCO
from pycocotools.cocoeval import COCOeval
# To check if we need to re-download
import os
# Helper to download files
def download_file(url, file_path):
response = requests.get(url, stream=True)
total_size_in_bytes = int(response.headers.get('content-length', 0))
block_size = 1024 # 1 Kibibyte
progress_bar = tqdm(total=total_size_in_bytes, unit='iB', unit_scale=True)
with open(file_path, 'wb') as file:
for data in response.iter_content(block_size):
progress_bar.update(len(data))
file.write(data)
progress_bar.close()
coco_url = "https://github.com/ultralytics/yolov5/releases/download/v1.0/coco2017val.zip"
coco_zip_path = 'coco2017val.zip'
# Only download and unzip if the coco directory is not there
if not os.path.exists('coco'):
download_file(coco_url, coco_zip_path)
# Unzip the file
with zipfile.ZipFile(coco_zip_path, 'r') as zip_ref:
zip_ref.extractall('.')
# Load pre-trained YOLOv8 model
model = YOLO('./yolov8n.pt')
# Load COCO validation images annotations
coco_annotations_path = 'coco/annotations/instances_val2017.json'
coco = COCO(coco_annotations_path)
# Get image IDs
image_ids = coco.getImgIds()
images = coco.loadImgs(image_ids)
# Process images and collect detections
results = []
for img in images:
image_path = f"coco/images/val2017/{img['file_name']}"
preds = model(image_path)[0].numpy().boxes
# Convert results to a COCO compatible format
for xyxy, conf, cls in zip(preds.xyxy, preds.conf, preds.cls):
result = {
'image_id': img['id'],
'category_id': int(cls.item()+1),
'bbox': [xyxy[0].item(), xyxy[1].item(), xyxy[2].item() - xyxy[0].item(), xyxy[3].item() - xyxy[1].item()],
'score': conf.item()
}
results.append(result)
# Convert results to COCO object
coco_dt = coco.loadRes(results)
# Run COCO evaluation
coco_eval = COCOeval(coco, coco_dt, 'bbox')
coco_eval.evaluate()
coco_eval.accumulate()
coco_eval.summarize()
Create a virtual environment, install the dependencies and run the report:
pip3 install ultralytics
pip3 install requests
pip3 install pycocotools
python3 ./coco_map_report.py
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.053
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.071
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.058
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.021
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.052
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.083
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.043
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.061
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.061
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.027
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.062
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.094
Key Takeaways
The IoU serves as a metric to measure how good a bounding box prediction is.
The IoU can serve as a threshold to discard or accept predictions.
Based on the IoU, object detectors have definitions per class of TP, FP, TN and FN.
Using these definitions and the IoU, we can compute the precision, recall and F1 score of a detector, per class.
Precision measures, for all the predictions made for a class, how many were actually correct.
Recall measures, for all the objects of a given class, the percentage that were actually predicted.
Measuring precision and recall on their own can be very misleading.
F1 Score measures a balance between precision and recall.
The shape and area under the curve of the PR-curve gives an indication of the performance of the detector, for a given class.
Interpolating the PR-curve gets rid of the curve spikes.
The AP in COCO evaluates the interpolated curve in 11 equidistant recall points.
mAP is the average of the AP of all classes.
mAP gives a measurement of the performance of the detector under different scenarios.
COCOEval is an API to compute COCO mAP in your python scripts.
Comments