## Setup
Pip install `ultralytics` and [dependencies](https://github.com/ultralytics/ultralytics/blob/main/pyproject.toml) and check software and hardware.

In [None]:
%pip install ultralytics[explorer] openai
import ultralytics
ultralytics.checks()

In [None]:
from ultralytics import Explorer

## Fine Tuning with Custom Data 

In [None]:
from ultralytics import YOLO

# Load your model
model = YOLO('yolov8n.pt')  # load a pretrained model (recommended for training)

# Train the model
results = model.train(data='path/to/your/data.yaml', epochs=100, imgsz=640, batch=32, save=True, exist_ok=True)

## 1. Similarity search
Utilize the power of vector similarity search to find the similar data points in your dataset along with their distance in the embedding space. Simply create an embeddings table for the given dataset-model pair. It is only needed once and it is reused automatically.


In [None]:
exp = Explorer("path/to/your.yaml", model="path/to/your/model.pt")
exp.create_embeddings_table()

One the embeddings table is built, you can get run semantic search in any of the following ways:
- On a given index / list of indices in the dataset like - `exp.get_similar(idx=[1,10], limit=10)`
- On any image/ list of images not in the dataset  - `exp.get_similar(img=["path/to/img1", "path/to/img2"], limit=10)`
In case of multiple inputs, the aggregade of their embeddings is used.

You get a pandas dataframe with the `limit` number of most similar data points to the input, along with their distance in the embedding space. You can use this dataset to perform further filtering
<img width="1120" alt="Screenshot 2024-01-06 at 9 45 42 PM" src="https://learnopencv.com/wp-content/uploads/2024/03/image4.png">


In [None]:
similar = exp.get_similar(idx=1, limit=10)
similar.head()

You can use the also plot the similar samples directly using the `plot_similar` util
<p>

 <img src="https://learnopencv.com/wp-content/uploads/2024/03/image9.png" />
</p>


 ### Plot list of idxs or imgs

In [None]:
exp.plot_similar(idx=6500, limit=20)
#exp.plot_similar(idx=[100,101], limit=10)

### Plot any external images

In [None]:
exp.plot_similar(img="https://learnopencv.com/wp-content/uploads/2024/03/rhino-test.jpg", limit=10, labels=False) 

## 2. Ask AI: Search or filter with Natural Language
You can prompt the Explorer object with the kind of data points you want to see, and it'll try to return a dataframe with those. Because LLMs power it, it doesn't always get it right. In that case, it'll return None.

In [None]:
df = exp.ask_ai("show me images containing more than 10 objects with at least 2 persons")
df.head(5)

for plotting these results you can use `plot_query_result` util
Example:
```
plt = plot_query_result(exp.ask_ai("show me 10 images containing exactly 2 persons"))
Image.fromarray(plt)
```
<p>
    <img src="https://github.com/AyushExel/assets/assets/15766192/2cb780de-d05b-4412-a526-7f7f0f10e669">

</p>

### Plot AI Results

In [None]:
from ultralytics.data.explorer import plot_query_result
from PIL import Image

plt = plot_query_result(exp.ask_ai("show me 10 images containing exactly 2 persons"))
Image.fromarray(plt)

## 3. Run SQL queries on your Dataset!
Sometimes you might want to investigate a certain type of entries in your dataset. For this Explorer allows you to execute SQL queries.
It accepts either of the formats:
- Queries beginning with "WHERE" will automatically select all columns. This can be thought of as a short-hand query
- You can also write full queries where you can specify which columns to select

This can be used to investigate model performance and specific data points. For example:
- let's say your model struggles on images that have humans and dogs. You can write a query like this to select the points that have at least 2 humans AND at least one dog.

You can combine SQL query and semantic search to filter down to specific type of results
<img width="994" alt="Screenshot 2024-01-06 at 9 47 30 PM" src="https://learnopencv.com/wp-content/uploads/2024/03/image15.png">


### Plot SQL Queries
Just like similarity search, you also get a util to directly plot the sql queries using `exp.plot_sql_query`

In [None]:
table = exp.sql_query("WHERE labels LIKE '%rhino%' AND labels LIKE '%elephant%' LIMIT 10")
table

In [None]:
exp.plot_sql_query("WHERE labels LIKE '%person, person%' AND labels LIKE '%dog%' LIMIT 10", labels=True)

## 4. Similarity Index
Here's a simple example of an operation powered by the embeddings table. Explorer comes with a `similarity_index` operation-
* It tries to estimate how similar each data point is with the rest of the dataset.
*  It does that by counting how many image embeddings lie closer than `max_dist` to the current image in the generated embedding space, considering `top_k` similar images at a time.

For a given dataset, model, `max_dist` & `top_k` the similarity index once generated will be reused. In case, your dataset has changed, or you simply need to regenerate the similarity index, you can pass `force=True`.
Similar to vector and SQL search, this also comes with a util to directly plot it. Let's look at the plot first
<img width="633" alt="Screenshot 2024-01-06 at 9 49 36 PM" src="https://github.com/AyushExel/assets/assets/15766192/96a9d984-4a72-4784-ace1-428676ee2bdd">



In [None]:
exp.plot_similarity_index(max_dist=0.2, top_k=0.01)

Now let's look at the output of the operation

In [None]:
import numpy as np

sim_idx = exp.similarity_index(max_dist=0.2, top_k=0.01, force=False)

In [None]:
sim_idx

Let's create a query to see what data points have similarity count of more than 30 and plot images similar to them.

In [None]:
import numpy as np

sim_count = np.array(sim_idx["count"])
sim_idx['im_file'][sim_count > 30]

You should see something like this

<img src="https://learnopencv.com/wp-content/uploads/2024/03/image1.png">


### Using avg embeddings of 2 images

In [None]:
exp.plot_similar(idx=[7146, 14035]) 