你可曾經歷過在 Meetup 上聽到一段絕妙的內容,但發現回憶不起具體的細節?作為一名積極組織和參與 Meetup 的開發者關系工程師,我常常會有這種感受。
為了解決這個問題,我開始探索使用相似性搜索技術來篩選大量的非結構化數據。非結構化數據占全世界數據的80%,可以通過不同的 ML 模型轉換為向量。本文中,我選擇的工具是 Milvus——一款流行的開源向量數據庫,擅長管理并搜索復雜的數據。Milvus 能夠幫助我們發現數據之間潛在的聯系和比較數據相似性。
本文將使用 SentenceTransformers 將非結構化數據轉換為 Embedding 向量。SentenceTransformers 是一個能夠將句子、文本和圖像轉換為 Embedding 向量的 Python 框架。您可以用它來編碼超過100種語言的句子或文本。然后,我們就可以通過相似度類型(例如:余弦距離)來比較這些 Embedding 向量,從而找到相似含義的句子。
01.
下載數據
Meetup.com 不提供免費的 public API。你需要購買 Pro 版本才可使用其 API。
本文中我自己生成了一些 Meetup 數據。您可以在 GitHub 上獲取這些數據(https://github.com/stephen37/similarity_search_mlops/blob/abc1d91878320911f069fcb8a2949b0d7d592370/data/data_meetup.csv),并使用 Pandas 加載數據。
import pandas as pd
df = pd.read_csv(‘data/data_meetup.csv’)
02.
技術棧:Milvus 和 SentenceTransformers
我們將使用 Milvus 作為 向量數據庫,SentenceTransformers 用于生成文本向量,OpenAI GPT 3.5-turbo 用于總結 Meetup 內容。由于 Meetup 通常包含很多內容,所以我們需要通過總結來簡化數據。
2.1 Milvus Lite
Milvus 提供了不同的部署選項以滿足不同的需求。對于追求快速設置的輕量級應用,Milvus Lite 是理想的選擇。它可
{FWD_PAGER}以通過 pip install pymilvus 輕松安裝,并直接在 Jupyter 筆記本中運行。
2.2 使用 Docker/Docker Compose 的 Milvus
對于追求穩定性的應用而言,可以使用 Docker Compose 部署分布式架構的 Milvus。您可以在文檔(https://milvus.io/docs/install_standalone-docker-compose.md)和 GitHub 頁面(https://github.com/milvus-io/milvus)上獲取 Docker Compose 文件。當您通過 Docker Compose 啟動 Milvus 時,您將看到三個容器,并通過默認的 19530 端口連接到 Milvus。
2.3 SentenceTransformers
SentenceTransformers 用于創建 Embedding 向量。可以在 PyPi 上通過 pip install sentence-transformers 安裝。我們將使用 all-MiniLM-L6-v2 模型,因為它與 SentenceTransformers 提供的最佳模型相比小 5 倍,速度快 5 倍,同時仍然提供與最佳模型相當的性能。
03.
進行相似性搜索
3.1 啟動 Milvus
為了進行相似性搜索,我們需要一個向量數據庫。通過Docker即可快速啟動 Milvus 向量數據庫(https://milvus.io/docs/install_standalone-docker.md)。
3.2 將數據導入 Milvus
在導入數據前,我們需要先創建 1 個 Collection 并設置 Schema。首先設置參數,包括字段 Schema、Collection Schema 和 Collection 名稱。
from pymilvus import FieldSchema, CollectionSchema, DataType, Collection
# We insert the object in the format of title, date, content, content embedding
fields = [
FieldSchema(name="id", dtype=DataType.INT64, is_primar
{FWD_PAGER}y=True, auto_id=True),
FieldSchema(name="title", dtype=DataType.VARCHAR, max_length=500),
FieldSchema(name="date", dtype=DataType.VARCHAR, max_length=100),
FieldSchema(name="content", dtype=DataType.VARCHAR, max_length=10000),
FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=384)
]
schema = CollectionSchema(fields=fields)
collection = Collection(name=”mlops_meetups”, schema=schema)
完成創建 Collection 和 Schema 后,現在讓我們來針對 embedding 字段創建索引,然后通過 load() 將數據加載到內存。
collection.create_index(field_name="embedding")
collection.load()
3.3 使用 SentenceTransformer 生成 Embedding 向量
正如之前所述,我們將使用 SentenceTransformer 以及 all-MiniLM-L6-v2 模型來生成 Embedding 向量。首先,讓我們導入所需的工具。
from sentence_transformers import SentenceTransformer
transformer = SentenceTransformer('all-MiniLM-L6-v2')
content_detail = df[‘content’]
content_detail = content_detail.tolist()
embeddings = [transformer.encode(c) for c in content_detail]
# Create an embedding column in our Dataframe
df['embedding'] = embeddings
# Insert the data in the collection
collection.insert(data=df)
3.4 總結 Meetup 內容
Meetup 的內容十分豐富,還會包含日程安排、活動贊助商以及場地/活動的特定規則等信息。這些信息對于參加 Meetup 來說非常重要,但對于我們本文的用例來說并不相關。我們將使用 OpenAI GPT-3.5-turbo 對 Meetup 內容進行總結。
def summarise_meetup_content(content: str) -> str:
response = openai.chat.completions.create(
model="gpt-3.5-turbo",
messages=[
{
"role": "system",
"content": "Summarize content you are provided with."
},
{
"role": "user",
"content": f"{content}"
}
],
temperature=0,
max_tokens=1024,
top_p=1,
frequency_penalty=0,
presence_penalty=0
)
summary = response.choices[0].message.content
return summary
3.5 返回相似內容
在進行相似性搜索前,需要確保向量數據庫能夠理解我們的查詢。因此,我們需要為查詢創建 Embedding 向量。
search_terms = "The speaker speaks about Open Source and ML Platform"
search_data = [transformer.encode(search
{FWD_PAGER}_terms)] # Must be a list.
3.5.1 在 Milvus Collection 中搜索相似內容
res = collection.search(
data=search_data, # Embedded search value
anns_field="embedding", # Search across embeddings
param={"metric_type": "IP"},
limit = 3, # Limit to top_k results per search
output_fields=["title", "content"] # Include title field in result
)
for hits_i, hits in enumerate(res):
print("Search Terms:", search_terms)
print("Results:")
for hit in hits:
content_test = hit.entity.get("content")
print(hit.entity.get("title"), "----", hit.distance)
print(f'{summarise_meetup_content(hit.entity.get("content"))} \n')
3.5.2 結果
Search terms: The speaker speaks about Open Source and ML Platform
Results:
First MLOps.community Berlin Meetup ---- 0.5537542700767517
The MLOps.community meetup in Berlin on June 30th will feature a main talk by Stephen Batifol from Wolt on Scaling Open-Source Machine Learning. The event will also include lightning
{FWD_PAGER} talks, networking, and food and drinks. The agenda includes opening doors at 6:00 pm, Stephen's talk at 7:00 pm, lightning talks at 7:50 pm, and socializing at 8:15 pm. Attendees can sign up for lightning talks on Meetup.com. The event is in collaboration with neptune.ai.
MLOps.community Berlin 04: Pre-event Women+ In Data and AI Festival ---- 0.4623506963253021
The MLOps.community Berlin is hosting a special edition event on June 29th and 30th at Thoughtworks. The event is a warm-up for the Women+ In Data and AI festival. The meetup will feature speakers Fiona Coath discussing surveillance capitalism and Magdalena Stenius talking about the carbon footprint of machine learning. The agenda includes talks, lightning talks, and networking opportunities. Attendees are encouraged to review and abide by the event's Code of Conduct for an inclusive and respectful environment.
MLOps.community Berlin Meetup 02 ---- 0.41342616081237793
The MLOps.community meetup in Berlin on October 6th will feature a main talk by Lina Weichbrodt on ML Monitoring, lightning talks, and networking opportunities. The event will be held at Wolt's o
{FWD_PAGER}ffice with a capacity limit of 150 people. Lina has extensive experience in developing scalable machine learning models and has worked at companies like Zalando and DKB. The agenda includes food, a bonding activity, the main talk, lightning talks, and socializing. Attendees can also sign up to give lightning talks on various MLOps-related topics. The event is in collaboration with neptune.ai.
作者:Stephen Batifol Developer Advocate at Zilliz
原文鏈接:https://blog.csdn.net/weixin_44839084/article/details/143883163