业务上碰到了需要对圆做聚类的场景
sklearn现在的聚类方法dbscan(支持自定义距离公式,但是性能影响巨大
kdtree(不支持自定义距离计算
google、百度无果
思考用rtree实现
参考:
https://zhuanlan.zhihu.com/p/62639268
计算过程
# 数据构建
num_points = 200000
centers = 200
center_box = (0, 300000)
random_state = 4
# 生成样本数据
rng = np.random.default_rng(random_state)
data = rng.uniform(low=center_box[0], high=center_box[1], size=(num_points, 2))
y1 = rng.integers(low=0, high=centers, size=num_points)
# 添加半径数据
r = rng.random(size=num_points) * 10
points = np.column_stack((data, r))
# 距离计算
def distance_with_radius(point1, point2):
# point1和point2分别为点的坐标和半径,例如 [x1, y1, r1], [x2, y2, r2]
x1, y1, r1 = point1
x2, y2, r2 = point2
dist = np.sqrt((x1 - x2)**2 + (y1 - y2)**2) # 计算欧氏距离
return dist - r1 - r2 # 减去半径之和
# 建树
p = index.Property()
p.dimension = 2
idx = index.Index(properties=p)
for i, (x, y, r) in enumerate(points):
idx.insert(i, (x - r, y - r, x + r, y + r))
# 圆形聚类
clusters = []
defect_clusters = []
# 类似dbscan,访问过的不再计算
visited = set()
for i in range(num_points):
if i not in visited:
cluster = []
defect_cluster = []
stack = [i]
while stack:
point = stack.pop()
visited.add(point)
cluster.append(point)
x, y, r = points[point]
# 在这里我算了两个eps 45和25
results = idx.intersection((x - r - 45, y - r - 45, x + r + 45, y + r + 45))
for j in results:
# 计算考虑半径的距离
distance = distance_with_radius(points[point], points[j])
if j not in visited and distance <= 45:
if distance <= 25:
defect_cluster.append(point)
stack.append(j)
clusters.append(cluster)
# 输出聚类结果
num_clusters = len(clusters)
print("聚类数量:", num_clusters)
labels_cluster = np.full(num_points, -1)
labels_defect = np.full(num_points, -1)
n = 1
for cluster in clusters:
if len(cluster) > 1:
labels_cluster[[idx for idx in cluster]] = n
n += 1
n = 1
for cluster in defect_clusters:
if len(cluster) > 1:
labels_defect[[idx for idx in cluster]] = n
n += 1
print(labels_cluster)
print(labels_defect)
相较于dbscan加自定义距离公式快了不止一倍
同理,kd-tree也可用同样的方式