Current location - Training Enrollment Network - Books and materials - Easy-to-understand distributed Kademlia algorithm
Easy-to-understand distributed Kademlia algorithm
In recent years, the popularity of blockchain technology (some people prefer to call it distributed ledger technology) has brought the concept of distributed technology into public view. Blockchain technology is highly sought after, on the one hand, because it shows a bright future, that is, with the help of computers, human beings can carry out social cooperation in a way without center, authority and hierarchy; On the other hand, it can be physically proved that distributed simple protocols are more efficient than centralized complex protocols. Distributed technology seems to bring fairness and efficiency.

It is not difficult to understand distributed technology, because it is not profound, but its design is often ingenious.

This paper introduces a common and ingenious distributed technology-Kademlia algorithm.

Kademlia algorithm is a distributed storage and routing algorithm. What is distributed storage? Imagine a school with 65,438+0,000 students. Now the school suddenly decided to tear down the library (no centralized server) and distribute all the books in the library to every student (all files are stored in each node). That is, all students * * * together form a distributed library.

In this case, there are several key questions to be answered.

Next, let's look at how Kademlia algorithm skillfully solves these problems.

First of all, let's take a look at what attributes each classmate (node) has:

Every student will maintain the following contents:

According to the above analogy, you can look at this table:

(For the explanation of the concept of hash, please refer to Baidu Encyclopedia-Hash Algorithm)

Why don't every student have a complete address book (every node maintains complete routing information): First, nodes in the distributed system come in and out quite frequently, and the address book of the whole network will be updated every time there is a change, and the traffic will be very large; Secondly, once a classmate is kidnapped by a bad guy (the node is hacked), the bad guy immediately has everyone's mobile phone number, which is not safe.

How are the books collected in the library distributed to students after being indexed neatly? The general principles include: 1) books can be distributed in students' hands in a relatively balanced way, and there will be no situation that some students have too many books and most students don't even have a book; 2) When students want to find a specific book, they can use a relatively simple indexing method to find it.

Kademlia made the following arrangements:

Assuming that the hash value of the title of Distributed Algorithm is 000 10000, this book will be required to be kept in the hands of students with the student number of 000 10000. (This requires that the range of the hash algorithm is consistent with the range of the node ID. The node ID of Kademlia is 160-bit binary. The example here is abbreviated as node ID)

But we must take into account that some students will be absent. In case 000 10000 doesn't come to school today (the node is not online or completely off the network), won't anyone get the book Distributed Algorithms? The algorithm requires that this book should exist not only in the hands of one student, but also in the hands of K students whose student numbers are closest to 000 10000, that is, 000 1000 1 0065438+.

Similarly, when you need to find the book distributed algorithm, hash the title and get 000 10000. This is the call number, and you will know which student to look for. The remaining problem is to find the mobile phone numbers of these students.

Since you only have some classmates' address books, you probably don't have the mobile phone number (IP address) of 000 10000. So how do you reach the target students?

A feasible idea is to find a classmate with the contact information of the target classmate in your address book. As mentioned earlier, each student's address book is layered by distance. The design of the algorithm is that the closer a classmate is to you, the greater the probability that ta's mobile phone number exists in your address book. The core idea of the algorithm can be: when you know the distance between the target classmate Z and you, you can first find a classmate B who you think is closest to classmate Z in your address book, and let classmate B go forward and find the mobile phone number of classmate Z.

The distance mentioned above is the XOR distance (node ID) between student numbers. XOR is used for yes/no or binary operations.

Give two examples:

The distance between 0 10 10000 and 01010 (that is, the XOR value of the two ids) is 000000 10 (which is 2 in decimal);

The distance between 0 10000000 and 000000 1 is 01(converted into decimal, it is 2 6+1,which is 65);

And so on.

How is the address book layered by distance? The following example will tell you that layering by XOR distance can basically be understood as layering by numbers. Imagine the following scenario:

Based on 0000 1 10, if the ID of a node is the same as it, except for the last 1, such a node is only1-00011,and exclusive OR with the base node. For 0000 1 10, such nodes are classified as "k barrels1";

If the ID of a node is the same as all the previous numbers, but different from the penultimate number, there are only two such nodes: 000010/0,000100, and the XOR value with the base node is 0000 1 1 and 0000. For 0000 1 10, such nodes are classified as "k bucket 2";

……

If the ID of a node is the same as all the previous numbers, but different from the last n, there are only two such nodes (I- 1), and the distance from the base node is [2 (i- 1), 2i); For 0000 1 10, such nodes are classified as "k bucket I";

Another way to understand the above description: if the nodes of the whole network are combed into a binary tree arranged by node ID, each leaf at the end of the tree is a node, and the following figure shows the distance relationship between nodes more intuitively.

Back to our analogy. Each student only maintains a part of the address book, which is layered according to the distance (it can be understood that the address book is layered according to the student number and their own student number), that is, k-bucket 1, k-bucket2, K-Bucket3 ... Although the actual number of students in each k bucket is increasing gradually, each student only records the mobile phone numbers of k students in each k bucket.

Because the student number (ID of the node) has 160 digits, each student's address book is divided into 160 layers (the node * * has 160 k buckets). The whole network can accommodate at most two students (nodes) of 160, but each student (node) only maintains the address book of 160 * k at most (addresses and ports of other nodes).

Now let's explain a complete book request process.

A classmate (student number 0000 1 10) wants to find a distributed algorithm. A needs to calculate the hash value of the title first, hash (distributed algorithm) = 000 10000. Then a knows that ta needs to find a classmate 000 10000 (named classmate z) or a classmate with a student number close to Z.

The XOR distance between Z's student number 000 10000 and himself is 0001010, and the distance range is [2 4, 2 5], so this Z student may be in K bucket 5 (in other words, Z's student number and a fifth student)

Then student A will check whether there is student Z 5 in his K bucket:

Kademlia's query mechanism is a bit like constantly folding a piece of paper in half to narrow the search scope, thus ensuring that for any n students, you only need to query the log for 2 (n) times at most to find the contact information of the target student (that is, for any one with [2 (n? 1), 2 n) A network with one node can find the target node in only n steps at most.

The above is the basic principle of Kademlia algorithm. The technical details of the agreement are briefly introduced as follows.

In Kademlia algorithm, each node has only four instructions.

This mechanism ensures that the joining and leaving of any node will not affect the whole network.

Kademlia is a distributed hash table (DHT). DHT is a decentralized distributed system. In this system, each node maintains a part of the stored content and the routes/addresses of other nodes, so that when any participant (node) in the network changes (enters/exits), the impact on the whole network is minimal. DHT can be used to build more complex applications, including distributed file system, peer-to-peer file sharing system, collaborative web caching, domain name system and real-time communication.

Kademlia algorithm was designed by Petar Maymounkov and David Mazières in 2002, which is characterized by layering hash tables with XOR distance. Kademlia was later adopted as the underlying algorithm by peer-to-peer software such as eMule and BitTorrent. Kademlia can be used as one of the foundations of information security technology.

The advantages of Kademlia are:

refer to

Wikipedia distributed hash table

Wikipedia -Kademlia

Kademlia: A Peer-to-Peer Information System Based on XOR Measure

Notes on Prince Kadmlia Pavilion

Han Feng. Artificial intelligence of blockchain. Post-translation Notes of New Economic Blueprint and Guide of Blockchain by Xinxing Publishing House.