Starting off, for transparency reasons I want to state that sklearn has a module for the DBSCAN algorithm. But I like to make things with minimum dependencies just because it's more educational and more fun.
One more thing I like to add, which I touched on a bit in the previous article about clustering. Is the fact that i like to use the least amount of dependencies as reasonably possible. Another thing to note is that traditionally ICT is applied to forex markets. To apply this methodology to crypto markets one important thing to note is because of the decentralized nature of crypto currencies, to really get a grasp of the market dynamic it is important to use aggregated data. Luckily the yahoo finance platforms offers this. So I use the yfinance python module. Just that i convert the pandas dataframe into a list of dictionaries to make the data more native to python. You really don't have to do this and I only do it as a matter of preference. So later code will be based on this fact. What I do really isn't that fancy, first to keep the timestamps I use reset_index(inplace=True) and then I use to_dict(orient='records') on the dataframe. This is more than enough to handle my needs, I just wanted to get that put of the way before continuing.
With that being said let's get into the good stuff. First off, DBSCAN stands for Density Based Spatial Clustering of Application woth Noise. It sounds pretty fancy but isn't as scary as it seems. We will get into this algorithm and how it works here and will turn put to be a pretty familiar method hopefully. Especially if you read my previos post on the concept of clustering.
This algorithm is very useful for identifying liquidity zones for various reasons. Starting by the fact that, as it's name suggests is ideal for noisy data. Sometimes with other algorithms sharp price movements like spikes can throw of your end results. Since the crypto markets and even forex markets for that matter have high volatility, this is a common problem that you can often run into. It groups together data points, in this case prices, that are around certain key levels of interest and marks the rest as outliers. So this algorithm as its name suggests, identifies data points where there is high density, in this case a significant amount pf market activity, and separates them from sparse data points.
The parameters this function takes are quite simple. First is epsilon (which is a greek letter but since my keyboard doesn't have that character I will represent it with "eps") which defines the maximum distance that can be between two points for them to be considered close together or "neighboring". The other parameter is the minimum amount of samples or data points in a certain region for it to be considered dense. That's it. For designing pir code I will use a pseudocode that I found on research gate which is linked bellow. Other concepts of interests are core points, border points and outliers which I will explain briefly. Core points are points with the minimum amount of data points within eps distance from them, which are considered members of a cluster. Border points are points that have core points within eps distance of them but do not have enough points near them to be considered core points. And basically everything else is considered noise.
Here in short we will talk about how this algorithm works. We start by separating the data points on two broad categories which represent points that have been visited by the algorithm and points tha haven't. We then pick an arbitrary point that has noy been visited. We then look around that point for all the other points that are within the eps distance. If there are enough points, this region is considered a cluster. If a given point is a core point we will expand the region to see if we can find more neighboring points that meet are criteria, this is called expansion. This is repeated until we can't expand more. After this is done the entire process is repeated until all points have been visited. Points that weren't reached by any core point are considered noise.
I will try to keep dependencies a s a minimal and stick to the standard library. One thing I'll use is deque (double sided queue from collections) to help use out with the expansion of clusters.
So now let's look in to our actual DBSCAN algorithm, be sure to import deque from collections.
def dbscan(prices, eps=0.2, min_samples):
#initialization
clusters = []
#we will use sets for visited and noise to make sure they are unique
visited = set()
noise = set()
#for readability we will define a function
#that gets the neighbors of a point
def get_neighbors(idx):
neighbors = []
for i in range(len(prices)):
if(abs(prices[idx] - prices[i]) <= eps):
neighbors.append(i)
return neighbors
#cycle through the prices
for i in range(len(prices)):
#skip visited points
if(i in visited):
continue
visited.add(i)
neighbors = get_neighbors(i)
#check if neighbors meet our minimum threshold
if(neighbors < min_samples):
noise.add(i)
else:
cluster = []
cluster.append(i)
#use the neighboring points to try to expand the cluster
queue = deque(neighbors)
while queue:
neighbor_idx = queue.popleft()
if(neighbor_idx not in visited):
visited.add(neighbor_idx)
next_neighbors(neighbor_idx)
#if the newly found neighbors exceed the threshold
#extend our queue with those points
if(len(next_neighbors) >= min_samples):
queue.extend(next_neighbors)
#dont add a point that is already in a cluster
if(not any(neighbor_idx in c for c in clusters)):
cluster.append(neighbor_idx)
clusters.append(cluster)
return clusters, list(noise)
So now we will feed it our data
and watch the results. Note that i will be using the closing prices from my data as there is a consensus on the fact that closing prices are significant in terms of the stabilizing of prices at the end of a period. For now I have set the eps variable to a hard coded default value. The observing eye would have noticed that this could be a problem since we are using the absolute distance to calculate the neighbors of a point. So it doesn't take into account that different assets have different magnitudes in price. It is not the same scale when we look at say the price of ethereum or the price of say DOGE. In the next post we will look into how we can make our eps parameter more dynamic.
This post is already pretty long so I will leave it here. You can follow me on Twitter to get updates on my future posts and also some signals that are generated by other strategies.