faiss中的索引基于几个基础算法构建,只不过在faiss中是一种高效的实现。分别是
- k-means聚类
- PCA降维
- PQ编码
- 解码
关于k-means聚类算法在之前的博客《OpenCV中的Kmeans聚类》中有介绍过,是一种无监督算法,自动的基于设定的聚类中心数进行聚类。
# 导入faiss
import sys
import faiss
import numpy as np
# 构建数据
d = 512 # 维数
n_data = 2000
np.random.seed(0)
data = []
mu = 3
sigma = 0.1
for i in range(n_data):
data.append(np.random.normal(mu, sigma, d))
data = np.array(data).astype('float32')
# 聚类
ncentroids = 100 # 设定聚类中心100个
niter = 20
verbose = True
d = data.shape[1]
kmeans = faiss.Kmeans(d, ncentroids, niter=niter, verbose=verbose)
kmeans.train(data)
# 输出聚类中心
print(kmeans.centroids)
print(kmeans.centroids.shape) # 聚类中心维度
[[3.0195723 3.0102541 3.0154822 ... 3.0012543 2.983653 2.9833684]
[2.963544 2.9899907 2.9942687 ... 2.989136 2.9432213 2.9964879]
[3.038704 2.9829412 3.0110557 ... 2.9955654 3.0334764 3.0621648]
...
[3.1335988 3.0283628 3.0185556 ... 2.9817123 2.8893135 2.846763 ]
[3.0277164 2.9488263 2.9712245 ... 3.046851 3.0523338 2.9560134]
[2.9731872 2.9688954 3.0551343 ... 3.0077536 3.038112 3.0422602]]
(100, 512)
结果保存在kmeans对象中:
kmeans.centroids
:聚类中心kmeans.obj
:迭代时候目标函数的值(总方差)
更多参数选择:
-
nredo
:运行此次数的群集,并保持最佳质心(根据聚类目标选择) -
verbose
:使聚类更详细 -
spherical
:执行球形K-means,每次迭代后,质心为L2标准化 -
INT_CORDROID
:圆形心质坐标取整 -
update_index
:每次迭代后重新训练索引? -
min_points_per_centroid / max_points_per_centroid
:以上训练集是分支的 -
seed
:随机数发生器的种子
计算某个向量属于哪一个子类,返回聚类中心次序和L2距离
D, I = kmeans.index.search(data[:5], 1)
print(D)
print(I)
[[5.027096 ]
[4.712036 ]
[4.7764983]
[4.5557623]
[4.403679 ]]
[[61]
[44]
[61]
[49]
[83]]
# 返回距离每个聚类中心最近的5个向量
index = faiss.IndexFlatL2 (d)
index.add(data)
D, I = index.search (kmeans.centroids, 5)
print(D)
print(I)
[[3.8203125e+00 4.3466797e+00 4.3652344e+00 4.3691406e+00 4.3789062e+00]
[3.6269531e+00 4.0800781e+00 4.1953125e+00 4.2304688e+00 4.2705078e+00]
[3.7167969e+00 4.0781250e+00 4.1015625e+00 4.1796875e+00 4.2685547e+00]
[3.4667969e+00 3.8759766e+00 4.1923828e+00 4.4365234e+00 4.5742188e+00]
[0.0000000e+00 9.0585938e+00 9.0957031e+00 9.1376953e+00 9.1484375e+00]
[3.7412109e+00 4.3515625e+00 4.4199219e+00 4.5517578e+00 4.5849609e+00]
[3.8154297e+00 4.2207031e+00 4.4433594e+00 4.5625000e+00 4.5986328e+00]
[3.5927734e+00 3.9169922e+00 4.3535156e+00 4.4375000e+00 4.4453125e+00]
[3.9101562e+00 4.3496094e+00 4.3554688e+00 4.4189453e+00 4.4384766e+00]
[0.0000000e+00 8.6308594e+00 8.8496094e+00 8.9980469e+00 9.0107422e+00]
[3.8427734e+00 4.2568359e+00 4.3339844e+00 4.3779297e+00 4.3789062e+00]
[3.5175781e+00 4.0654297e+00 4.1015625e+00 4.2148438e+00 4.2558594e+00]
[2.3457031e+00 2.3486328e+00 6.8046875e+00 6.8427734e+00 6.9658203e+00]
[9.7656250e-04 8.8378906e+00 8.9287109e+00 8.9941406e+00 9.0283203e+00]
[3.7519531e+00 4.3740234e+00 4.5185547e+00 4.5371094e+00 4.6191406e+00]
[0.0000000e+00 8.5371094e+00 8.6289062e+00 8.6933594e+00 8.9248047e+00]
[3.5009766e+00 4.2011719e+00 4.2431641e+00 4.3730469e+00 4.4199219e+00]
[3.3779297e+00 3.6240234e+00 3.6523438e+00 3.7822266e+00 3.9042969e+00]
[3.7656250e+00 4.2851562e+00 4.3125000e+00 4.5205078e+00 4.5849609e+00]
[3.7421875e+00 4.2714844e+00 4.4677734e+00 4.5693359e+00 4.5761719e+00]
[3.8076172e+00 4.3037109e+00 4.3144531e+00 4.3457031e+00 4.3574219e+00]
[0.0000000e+00 8.8740234e+00 8.9199219e+00 8.9599609e+00 8.9638672e+00]
[3.3349609e+00 3.7226562e+00 3.7910156e+00 4.0488281e+00 4.3642578e+00]
[2.9335938e+00 3.1669922e+00 3.4218750e+00 5.9628906e+00 6.0009766e+00]
[2.2109375e+00 2.2128906e+00 6.5175781e+00 6.7275391e+00 6.8320312e+00]
[2.0791016e+00 2.0800781e+00 6.6953125e+00 6.7089844e+00 6.7255859e+00]
[0.0000000e+00 8.6171875e+00 8.6777344e+00 9.0263672e+00 9.0664062e+00]
[9.7656250e-04 8.8359375e+00 8.8808594e+00 8.9208984e+00 8.9423828e+00]
[2.4062500e+00 2.4082031e+00 6.8105469e+00 6.9472656e+00 7.0068359e+00]
[9.7656250e-04 8.6279297e+00 8.7968750e+00 8.9531250e+00 8.9550781e+00]
[0.0000000e+00 8.9794922e+00 9.0263672e+00 9.0439453e+00 9.0703125e+00]
[3.2343750e+00 3.5175781e+00 3.5957031e+00 3.9169922e+00 5.3691406e+00]
[2.2929688e+00 2.2949219e+00 6.6669922e+00 6.6992188e+00 6.7089844e+00]
[3.8671875e+00 4.3222656e+00 4.3359375e+00 4.3564453e+00 4.4472656e+00]
[3.7900391e+00 4.3183594e+00 4.3662109e+00 4.3964844e+00 4.5839844e+00]
[3.7949219e+00 4.3251953e+00 4.3828125e+00 4.3964844e+00 4.4072266e+00]
[0.0000000e+00 8.5292969e+00 8.8564453e+00 8.8642578e+00 8.8750000e+00]
[2.8613281e+00 3.0986328e+00 3.3378906e+00 6.0507812e+00 6.0644531e+00]
[2.9062500e+00 2.9882812e+00 3.3222656e+00 5.8398438e+00 6.0195312e+00]
[3.7910156e+00 4.1953125e+00 4.2392578e+00 4.3154297e+00 4.3330078e+00]
[3.8828125e+00 4.2753906e+00 4.3330078e+00 4.4433594e+00 4.4492188e+00]
[3.4697266e+00 4.0683594e+00 4.3691406e+00 4.4130859e+00 4.4150391e+00]
[3.8027344e+00 4.3623047e+00 4.4492188e+00 4.4511719e+00 4.5117188e+00]
[3.8144531e+00 4.4375000e+00 4.5009766e+00 4.5097656e+00 4.5244141e+00]
[3.7324219e+00 4.3505859e+00 4.3671875e+00 4.3886719e+00 4.4052734e+00]
[3.5830078e+00 4.1464844e+00 4.2402344e+00 4.2773438e+00 4.4667969e+00]
[3.3808594e+00 3.9414062e+00 4.0000000e+00 4.1132812e+00 4.1464844e+00]
[2.9091797e+00 3.1630859e+00 3.4140625e+00 5.7685547e+00 5.8164062e+00]
[0.0000000e+00 9.0380859e+00 9.0937500e+00 9.0957031e+00 9.1191406e+00]
[3.6074219e+00 4.2148438e+00 4.2587891e+00 4.3574219e+00 4.3632812e+00]
[9.7656250e-04 8.8525391e+00 8.9111328e+00 9.0576172e+00 9.1298828e+00]
[2.3740234e+00 2.3750000e+00 6.7568359e+00 6.9619141e+00 7.0683594e+00]
[2.2177734e+00 2.2197266e+00 6.6367188e+00 6.7744141e+00 6.8906250e+00]
[1.9531250e-03 8.6640625e+00 9.1230469e+00 9.1894531e+00 9.2167969e+00]
[0.0000000e+00 8.7958984e+00 9.0039062e+00 9.0644531e+00 9.0917969e+00]
[2.8320312e+00 3.1093750e+00 3.2480469e+00 5.9785156e+00 6.0566406e+00]
[3.8183594e+00 4.3847656e+00 4.4121094e+00 4.4179688e+00 4.4355469e+00]
[2.4335938e+00 2.4355469e+00 6.8212891e+00 6.8222656e+00 6.8789062e+00]
[0.0000000e+00 8.7382812e+00 8.8437500e+00 8.9365234e+00 9.0361328e+00]
[3.8437500e+00 4.3125000e+00 4.3320312e+00 4.3818359e+00 4.4121094e+00]
[3.3281250e+00 3.5781250e+00 3.9941406e+00 4.0322266e+00 4.4687500e+00]
[3.7285156e+00 4.1611328e+00 4.2333984e+00 4.2851562e+00 4.3378906e+00]
[3.7382812e+00 4.1406250e+00 4.3603516e+00 4.4218750e+00 4.4697266e+00]
[2.2978516e+00 2.2998047e+00 6.7763672e+00 6.7939453e+00 6.7998047e+00]
[3.7988281e+00 4.3564453e+00 4.4707031e+00 4.5048828e+00 4.5126953e+00]
[3.7900391e+00 4.3642578e+00 4.4472656e+00 4.4902344e+00 4.4931641e+00]
[2.1865234e+00 2.1894531e+00 6.5410156e+00 6.6201172e+00 6.7519531e+00]
[2.2285156e+00 2.2285156e+00 6.6171875e+00 6.6210938e+00 6.6757812e+00]
[3.8369141e+00 4.2773438e+00 4.3076172e+00 4.4609375e+00 4.4980469e+00]
[2.9296875e-03 8.6015625e+00 8.6787109e+00 8.8613281e+00 8.9052734e+00]
[3.8437500e+00 4.3750000e+00 4.4257812e+00 4.4609375e+00 4.4843750e+00]
[3.8623047e+00 4.2587891e+00 4.2812500e+00 4.3554688e+00 4.3710938e+00]
[3.6523438e+00 3.9287109e+00 3.9990234e+00 4.1103516e+00 4.1933594e+00]
[0.0000000e+00 8.9931641e+00 9.1396484e+00 9.1943359e+00 9.2109375e+00]
[2.2480469e+00 2.2480469e+00 6.8378906e+00 6.9013672e+00 6.9023438e+00]
[3.7441406e+00 4.4804688e+00 4.5458984e+00 4.5761719e+00 4.5996094e+00]
[3.6289062e+00 4.0683594e+00 4.3623047e+00 4.4296875e+00 4.4677734e+00]
[3.7363281e+00 4.3417969e+00 4.3515625e+00 4.4511719e+00 4.4960938e+00]
[0.0000000e+00 8.7128906e+00 8.9687500e+00 9.0869141e+00 9.1064453e+00]
[3.4726562e+00 3.6914062e+00 4.0546875e+00 4.0927734e+00 4.4355469e+00]
[3.5927734e+00 4.3212891e+00 4.3271484e+00 4.4785156e+00 4.5058594e+00]
[3.1914062e+00 3.4531250e+00 3.5703125e+00 3.5722656e+00 5.5263672e+00]
[3.7792969e+00 4.2929688e+00 4.3574219e+00 4.3710938e+00 4.3955078e+00]
[3.7207031e+00 4.0117188e+00 4.1591797e+00 4.4082031e+00 4.4091797e+00]
[2.3359375e+00 2.3359375e+00 6.6621094e+00 6.9716797e+00 7.1054688e+00]
[3.6054688e+00 4.1835938e+00 4.1933594e+00 4.3076172e+00 4.3876953e+00]
[3.0058594e+00 3.3808594e+00 3.4121094e+00 5.8144531e+00 5.9804688e+00]
[3.6806641e+00 4.0449219e+00 4.0625000e+00 4.2500000e+00 4.3759766e+00]
[2.2041016e+00 2.2070312e+00 6.7275391e+00 7.0244141e+00 7.0302734e+00]
[2.8291016e+00 3.2236328e+00 3.3925781e+00 5.7548828e+00 5.9121094e+00]
[3.8447266e+00 4.2128906e+00 4.3095703e+00 4.3681641e+00 4.3886719e+00]
[3.0585938e+00 3.7578125e+00 3.8242188e+00 3.8652344e+00 5.5468750e+00]
[3.5449219e+00 3.8789062e+00 4.0400391e+00 4.1953125e+00 4.2910156e+00]
[3.5341797e+00 4.1005859e+00 4.3300781e+00 4.3359375e+00 4.3554688e+00]
[3.6171875e+00 4.1240234e+00 4.1435547e+00 4.3378906e+00 4.3505859e+00]
[3.7167969e+00 4.3369141e+00 4.4003906e+00 4.4765625e+00 4.5156250e+00]
[3.7050781e+00 4.2529297e+00 4.2763672e+00 4.3886719e+00 4.4355469e+00]
[3.2031250e+00 3.4453125e+00 3.5498047e+00 3.7802734e+00 5.4550781e+00]
[3.3593750e+00 3.8710938e+00 4.1171875e+00 4.1445312e+00 4.5146484e+00]
[3.6142578e+00 3.8007812e+00 4.0078125e+00 4.2519531e+00 4.3789062e+00]]
[[1083 277 1624 1555 981]
[1411 198 620 414 1664]
[ 140 28 317 1915 1597]
[1704 1252 1745 1906 126]
[ 867 1055 308 1483 788]
[1882 1937 1568 981 1695]
[1885 1826 481 37 779]
[1725 1417 19 1356 1785]
[1801 981 1624 145 1524]
[1011 835 700 650 148]
[1496 650 681 1624 518]
[ 780 545 881 61 101]
[1428 1042 1489 389 511]
[ 454 259 329 1336 444]
[1273 1727 106 981 1438]
[1027 1686 106 121 835]
[1374 287 1825 637 35]
[1342 288 1146 1958 1348]
[ 528 1678 1167 592 145]
[ 440 981 337 402 1168]
[1174 773 1686 981 519]
[ 294 30 685 655 133]
[ 607 1512 623 1262 1960]
[1191 668 1140 1100 724]
[1854 1851 121 681 276]
[ 781 1927 704 610 1417]
[1708 896 1707 1172 1331]
[ 908 1558 1116 519 1686]
[ 88 1848 1624 277 773]
[ 111 981 1417 1666 464]
[1961 1116 1664 1185 228]
[ 368 1905 890 1642 604]
[1607 1376 163 48 1893]
[1653 981 427 1523 655]
[1087 1353 31 1120 832]
[1800 981 413 1624 106]
[1347 182 740 1987 1618]
[ 58 1404 105 981 604]
[1455 81 170 198 685]
[ 685 277 981 145 879]
[1850 1334 981 277 1886]
[1758 922 1600 1506 1192]
[ 269 981 1483 1162 277]
[1125 1624 106 1907 1401]
[1504 511 981 106 28]
[ 62 1377 222 1839 128]
[1205 423 939 862 1824]
[1124 216 1104 145 1353]
[ 435 586 1660 456 269]
[ 740 1181 1578 1743 155]
[1809 1571 608 497 1956]
[1050 1203 1624 1536 1666]
[1171 307 156 906 790]
[ 658 1241 1337 804 1540]
[ 425 629 1318 277 427]
[ 783 1449 419 121 650]
[ 66 981 1459 277 315]
[1856 1114 1055 940 1558]
[ 87 1374 916 1707 1057]
[ 712 981 182 277 106]
[1574 329 136 197 971]
[ 879 981 1624 277 106]
[ 568 389 1241 981 560]
[1528 1697 1008 725 1523]
[1255 835 1384 981 1115]
[1912 981 548 94 277]
[1004 1048 1558 574 182]
[1299 1418 1624 121 1470]
[1070 1384 981 145 1624]
[ 514 879 1949 1540 329]
[1876 981 277 1459 650]
[ 627 1008 981 963 529]
[1699 1132 1158 1431 1837]
[1843 902 402 1575 1304]
[ 340 1190 842 504 1558]
[ 453 457 648 1079 234]
[ 344 410 18 470 1877]
[ 124 277 981 1624 1555]
[ 275 1237 1381 1439 875]
[ 793 1249 142 575 664]
[ 711 952 1751 1229 619]
[1722 1141 1672 917 1654]
[1540 1707 277 879 103]
[1176 574 1459 4 256]
[ 584 1092 829 681 291]
[ 687 874 1644 646 535]
[1577 1715 1239 1623 259]
[1897 1130 1179 1553 1587]
[ 479 841 122 28 1561]
[1921 1343 109 518 1686]
[ 78 981 277 896 145]
[1736 229 1453 1157 542]
[1858 610 616 1144 9]
[ 741 1895 1400 131 836]
[1774 586 1470 56 172]
[ 249 1554 277 613 981]
[ 159 1970 255 605 471]
[1392 909 228 104 121]
[ 666 540 403 1804 372]
[ 244 99 1705 1331 231]]
mat = faiss.PCAMatrix (512, 64) # 从512维降为64维
mat.train(data)
print(data.shape)
assert mat.is_trained
tr = mat.apply_py(data)
print(tr.shape)
(2000, 512)
(2000, 64)
可以看到维度从512维度降低到64维度
d = 512 # 数据维度
cs = 4 # code size (bytes)
# 训练数据集
xt = data
# dataset to encode (could be same as train)
x = data
pq = faiss.ProductQuantizer(d, cs, 8)
pq.train(xt) # 训练
# encode编码
codes = pq.compute_codes(x)
# decode解码
x2 = pq.decode(codes)
# 编码-解码后与原始数据的差
avg_relative_error = ((x - x2)**2).sum() / (x ** 2).sum()
print(avg_relative_error)
0.0008764966
标量量化器(scalar quantizer)与之类似
d = 512 # 数据维度
# 训练集
xt = data
# dataset to encode (could be same as train)
x = data
# QT_8bit allocates 8 bits per dimension (QT_4bit also works)
sq = faiss.ScalarQuantizer(d, faiss.ScalarQuantizer.QT_8bit)
sq.train(xt)
# encode 编码
codes = sq.compute_codes(x)
# decode 解码
x2 = sq.decode(codes)
# 计算编码-解码后与原始数据的差
avg_relative_error = ((x - x2)**2).sum() / (x ** 2).sum()
print(avg_relative_error)
6.7287445e-08