Data structure is a widely used term in the whole computer science and technology field. It is used to reflect the internal composition of a data, that is, a data is composed of those constituent data, in what way and in what structure. Data structures can be divided into logical data structures and physical data structures. The logical data structure reflects the logical relationship between component data, while the physical data structure reflects the storage arrangement of component data in the computer. Data structure is the existing form of data. Data structure is a way to organize information, and its purpose is to improve the efficiency of the algorithm. It usually corresponds to a set of algorithms, through which some operations can be performed on the data in the data structure.
What does the data structure mainly study?
As a discipline, data structure mainly studies various logical structures and storage structures of data, as well as various operations on data. So there are three main aspects: the logical structure of data; Physical storage structure of data; An operation (or algorithm) on data. Usually, the algorithm's
The design depends on the logical structure of the data, and the implementation of the algorithm depends on the physical storage structure of the data.
What is a data structure? What are logical structure and physical structure?
Data refers to a set of elements consisting of a finite number of symbols (such as "0" and "1", which have their own structures, operations and corresponding semantics). A structure is a collection of relationships between elements. Generally speaking, the data structure DS can be represented as a binary group:
Ds = (d, s),//that is, data-structure = (data-part, logic-structure-part), where D is a set of data elements (or "nodes", or possibly "data items" or "data fields") and S is a set of relationships defined on D (or other sets). There are four basic types of logical structure: set structure, linear structure, tree structure and network structure. Table and tree are two most commonly used efficient data structures, and many efficient algorithms can be designed and implemented by using these two data structures. Tables are linear structures (fully ordered relationships), while trees (partially ordered or hierarchical relationships) and graphs (weakly/locally ordered) are nonlinear structures.
The physical structure of data structure refers to the storage image of logical structure. The physical structure P of the data structure ds corresponds to the mapping from the data elements of DS to the storage area M (maintenance logical structure S):
(PD,S)-& gt; M memory model: A memory M is a series of memory cells with fixed size, and each cell U has a unique address A(U), which is coded continuously. Every unit u has a unique successor unit U'= succ(U). Four basic mapping models of P: sequential mapping, link mapping, index mapping and hash mapping.
Therefore, we can get at least 4×4 possible physical data structures:
Order (group)
Link list
Index tree
Hash graph
(Not all possible combinations are reasonable)
Operations on data structure DS: When changing data elements (nodes) or domains of nodes, all operations defined on DS must maintain the logical and physical structure of DS.
Basic operations on DS: Any other advanced operations on DS can be realized through these basic operations. It is best to regard DS and all its basic operations as a whole-called a module. We can further abstract this module into a data type (in which the storage structure of DS is represented as a private member and the basic operation is represented as a public method), which is called ADT. Just like ADT, stack and queue are special tables with subsets of table operations. The advanced operation of DATs can be designed as an (uncompressed) algorithm, which uses basic operations to deal with DS.
Good and bad DS: If a DS can be converted into a linear DS through some "linear rule" (such as linear table), it is called a good DS. A good DS usually corresponds to a good (efficient) algorithm. This is determined by the computing power of the computer, because the computer can only access logically continuous memory cells in essence, so how to have a structure without linearization is logically uncountable. For example, to operate a graph, to access all nodes of the graph, all nodes must be accessed in a certain order (forming a partial order), and the inherent nonlinear structure of the graph must be transformed into a linear structure in some way before the graph can be operated.
Tree is a good DS-it has very simple and effective linearization rules, so many very effective algorithms can be designed with tree. The implementation and use of the tree are very simple, but it can solve many particularly complicated problems, so the tree is the most important and useful data structure in actual programming. The structure of the tree is recursive in nature-each leaf node can be replaced by a subtree, and vice versa. In fact, every recursive structure can be transformed into (or equivalent to) a tree structure.
Abstraction from machine language to high-level language
As we all know, an algorithm is defined as a series of operations. All operations in this operation sequence are defined on a specific data model to solve specific problems. This operation sequence should have the following four characteristics. Finite, that is, the number of items in the sequence is limited, and each operation item can be completed in a limited time; Certainty, that is, every operation of the sequence has a clear definition, and there is no ambiguity; There can be no input operator, but there must be an output operator; Feasibility, that is, any given legal input can get corresponding correct output. These characteristics can be used to judge whether an operation sequence can be called an algorithm. However, our problem now is not to judge whether an operation sequence can be called an algorithm, but to review how we use programming language to express an operation sequence that has been called an algorithm.
In the final analysis, the program expression of the algorithm is the program expression of the algorithm elements, because once every element of the algorithm is clearly expressed by the program, the program expression of the whole algorithm will not be a problem.
As an algorithm of operation sequence, there are three elements. The data as the operation object and the operation results of various operations in the operation sequence; Various operations in the operation sequence; Control transfer in operation sequence. These three elements are called data, operation and control in turn. Because of the endless variety of algorithms, there are countless object data and result data. The simplest and most basic are Boolean data, character data, integer and real data. Slightly complicated data such as vectors, matrices and records; More complicated are sets, trees and graphics, as well as data such as sound, graphics and images. Similarly, due to the endless variety of algorithms, the types of operations are also varied and colorful. The most basic and initial operations are assignment operation, arithmetic operation, logical operation and relational operation. Slightly more complicated are arithmetic expressions and logical expressions; More complicated are function value calculation, vector operation, matrix operation, set operation, and operations on tables, stacks, queues, trees, graphs, etc. In addition, the operations listed above may be compound and nested. About the transfer of control, it is relatively simple. In serial calculation, there are only several kinds, such as sequence, branch, cycle, recursion and unconditional transfer.
Let's review how the program expression of the above three elements of the algorithm has gone through since the emergence of computers.
The earliest programming language was machine language, that is, the instruction set on a specific computer. At that time, all the algorithms to be run on the computer must be expressed directly in machine language before the computer can accept them. The operation sequence of the algorithm, including the operation object and the operation result, must be converted into the instruction sequence. Each instruction appears in the form of code (instruction code and address code). Compared with the algorithm expressed in algorithmic language, the difference is 108 thousand Li. For people who have no special training in programming, a program is like a "gobbledygook", which makes people confused and unreadable.
Very poor.
It is very complicated to express the operation, data and control of the algorithm in machine language, because the instructions provided by machine language are too primitive. Machine language only accepts arithmetic operation, bitwise logic operation and comparison operation of numbers. For slightly complicated operations, they must be decomposed one by one until the initial operation is reached, and the corresponding instructions can be used instead. Only the most primitive bits, bytes and words can be directly expressed in machine language. Even the simplest data in the algorithm, such as Boolean values, characters, integers and real numbers, must be mapped to bits, bytes and words one by one.
You have to allocate their storage units one by one. It is much more troublesome to express structured data in algorithms. The control transfer instructions provided by machine language are only the most basic instructions such as unconditional transfer, conditional transfer, entering subroutine and returning from subroutine. Using them to construct loops, form branches, call functions and procedures requires a lot of preparation in advance and depends on a lot of skills. There are many shortcomings in expressing the algorithm directly in machine language.
A large number of trivial details bind programmers, making them unable to have more time and energy to engage in creative work and perform tasks that are more important to them. Such as ensuring the correctness and efficiency of the program. Programmers should not only master the overall situation of program design, but also go deep into every part until the details are realized. Even programmers with superior intelligence often ignore one thing and make mistakes repeatedly, so the compiled program has poor reliability and long development cycle. Because the thinking and expression of programming in machine language are very different from people's habits, only in
Only programmers who have been trained for a long time can be competent, which makes programming music high and low. Because its written form is all "secret" codes, it is not readable and is not convenient for communication and cooperation. Because it depends heavily on a specific computer, its portability and reusability are poor. These shortcomings led to the lack of rapid popularization of computer applications at that time.
The way to overcome the above shortcomings is to abstract the programming language and make it as close as possible to the algorithm language. For this reason, people first pay attention to readability and portability, because they are relatively easy to improve through abstraction. As a result, the assembly language soon appeared. This language's abstraction of machine language is first manifested in symbolizing every instruction of machine language: replacing instruction code with a memory symbol and replacing address code with a symbolic address, so that its meaning appears on the symbol rather than hidden in the code, which can make people expect the meaning of the text. Secondly, this language can run on computers with different instruction sets, as long as the computer is equipped with an assembler of assembly language. This is undoubtedly a step closer to the algorithmic language. But it is too far away from the algorithm language, so that programmers can't get rid of the complicated and trivial things such as data decomposition, operation control and assembly of instructions that can be directly expressed. In the mid-1950s, Fortran, Algol60 and later PL/l, Pascal and other advanced programming languages appeared, and the program expression of the algorithm made a great leap.
Admittedly, the algorithm will eventually be expressed in the machine language of a specific computer before it can run on the computer and get the required results. However, the practice of assembly language enlightens people that it is not necessary to achieve machine language in one step, but it can be done in two steps or build a bridge across the river. That is, it is first expressed in an intermediate language and then transformed into a machine language. As an intermediate language, assembly language has not achieved great success because it is far from algorithmic language.
It is still too far. This leads people to design a standard language that is as close as possible to the algorithm language, that is, the so-called high-level language, so that programmers can use it to express the algorithm conveniently, and then use the standard high-level language to "translate" the standard machine language, and finally express the algorithm into the machine language. Moreover, because high-level language and machine language are normative, the "translation" here can be done mechanically by the computer, just like translating assembly language into machine language, as long as the computer is equipped with a compiler. The first step is done by the programmer, and the second step can be done by the compiler. After specifying what they should do, these two steps are completely independent. How they should do it doesn't matter to each other. The first step is to correctly express a given algorithm in a high-level language and generate a high-level language program; The next step is to translate the high-level language program obtained in the first step into a machine language program. As for how programmers express algorithms in high-level languages and how compilers translate algorithms expressed in high-level languages into algorithms expressed in machine languages, it obviously doesn't matter.
The above thinking method of dealing with the complex process from algorithmic language to machine language is an abstraction. The emergence of assembly language and high-level language are both examples of this abstraction. Compared with assembly language, the great success of high-level language lies in its data, operation and control.
Many concepts and tools close to algorithm language are introduced into the expression of surface, which greatly improves the ability of abstract expression algorithm. In terms of operation, Pascal and other advanced languages not only allow the four operations, logical operations, relational operations, arithmetic expressions and logical expressions of the algorithm language to be used intact, but also introduce powerful functions and process tools, which users can customize themselves. The importance of this tool is not only that it simplifies the repeated program text segments, but also that it embodies the two-level abstraction of the program.
At the level of function and procedure call, people only care about what it can do, not how it is done. Only when it comes to the definition of functions and procedures do people give the details of how to do it. Readers who have used high-level languages know that once the names, parameters and functions of functions and procedures are clearly defined, calling them in the program and explaining them at the program header are completely separated. You can modify or even replace function bodies and procedure bodies without affecting their calls. If function and procedure names are regarded as operation names and parameters are regarded as operation objects or operation results, then
The call of functions and procedures is no different from the reference of elementary operations. Any complex operation in algorithmic language can be naturally expressed by using functions and procedures and their compounding or nesting.
In terms of data, Pascal and other high-level languages have introduced the concept of data type, that is, classifying all data. Every data (including expressions) or every data variable belongs to a certain category. This kind of data is called data type. Therefore, data type is a description of data or data variable category, which represents the whole set of values that data or data variable may take. For unstructured data, Pascal and other advanced languages not only provide standard basic data types-Boolean, character, integer and real number, but also provide user-defined enumeration classes, subboundaries and pointer types. These types (except pointers) are used in algorithmic languages according to people's habits. For structured data, Pascal and other high-level languages provide four standard structured data types: array, record, restricted set and file. Among them, array is the abstraction of vector and matrix in scientific calculation; Records are abstractions of business and management records; A finite set is an abstraction of the potential set of a sufficiently small set in mathematics. A file is an abstraction of data stored externally (such as a disk).
People can use the provided basic data types (including standard and user-defined) to construct structured data according to the construction rules of arrays, records, restricted sets and files. In addition, it also allows users to use standard structural data types to construct more complex and higher-level structural data through combination or nesting. This makes the data types in high-level languages have obvious hierarchy. The hierarchical structure of data types in high-level languages is endless, so they can be used to express data of any complex level in algorithmic languages. In terms of control, Pascal and other high-level languages provide six ways to express algorithm control transfer.
(1) Default sequence control ";" .
(2) Conditional (branch) control: "If the expression is true, then S 1 otherwise S2;" .
(3) Selection (situational) control:
"Case statement
Value 1: S 1
Value 2: S2
...
Value n: Sn
End "
(4) Cycle control:
“while expression(true)do S;” or
"Repeat until the expression (true);" or
"For variable name: = initial value to/down to final value do S;;"
(5) Call of functions and procedures, including call of recursive functions and recursive procedures.
(6) unconditionally transfer goto.
These six expressions not only cover all the requirements of control expressions in algorithmic language, but also are not as primitive, complicated and obscure as machine language or assembly language, but as seen above, they are almost the same as those in natural language. The abstraction of programming language from machine language to high-level language brings the following main benefits: high-level language is close to algorithmic language, easy to learn and master, and ordinary engineers and technicians can be competent as programmers only after a few weeks of training; High-level languages provide programmers with an environment and tools for structured programming, which makes the designed programs readable, maintainable and reliable. High-level language is far away from machine language and has little to do with specific computer hardware, so the written program has good portability and high reuse rate; Because the complicated and trivial affairs are handed over to the compiler, the degree of automation is high and the development cycle is short. Freeing programmers and programmers, we can concentrate our time and energy on creative work that is more important to them and improve the quality of programs.
Data structures, data types and abstract data types
Data structure, data type and abstract data type are both different and similar literally, which reflects their differences and connections in meaning.
Data structure is a widely used term in the whole computer science and technology field. Used to reflect the internal composition of a data, that is, which component data a data consists of, how to form it, and what structure it presents. Data structures can be divided into logical data structures and physical data structures. The logical data structure reflects the logical relationship between component data, and the physical data structure reflects the storage arrangement of component data in the computer. Data structure is the existing form of data.
Data are classified according to data structure, and data with the same data structure belong to the same category. The sum of similar data is called a data type. In high-level programming languages, data types are used to explain the attributes of data in data classification. It is an attribute of data. This property limits the range of data changes. In order to solve the problem, high-level languages define a series of data types according to the types of data structures. Different high-level languages define different data types. The type of data type defined by Pascal language.
Wherein, a simple data type corresponds to a simple data structure; The construct data type corresponds to a complex data structure; In complex data structures, component data itself allows complex data structures, so the constructed data types allow nesting; Pointer type corresponds to the relationship between component data in data structure, which is a simple data type on the surface, but actually points to complex component data, that is, data in structured data type, so it is not classified as simple data type or structured data type here, but as a separate category.
Data structure reflects the internal structure of data, which is often described by structure diagram: each component data in the data is regarded as a node, represented by a box or circle, and the relationship between component data is represented by the connection line with arrows between corresponding nodes. If the component data itself has its own structure, the structure is nested. Nesting here also allows recursive nesting.
Due to the introduction of pointer data, it is possible to construct various complex data structures. According to the relationship between the constituent data in the data structure, the data structure can be divided into linear and nonlinear. In a nonlinear data structure, there are layers and grids. Because data types are divided according to data structures, a data structure corresponds to a data type. According to the structure of data in this type, data types are also divided into linear and nonlinear, hierarchical and mesh. The type description of data variables in high-level languages must be the data type corresponding to the data structure of read variables. The most commonly used data structures are array structure and record structure. The array structure is characterized in that:
The number of component data is fixed, and the logical relationship between them is reflected by the serial number of the component data (or the subscript of the array). These component data are arranged one after another in the order of serial numbers. The data of each component has the same structure (simple structure or complex structure), so it belongs to the same data type (simple data type or structural data type respectively). This same data type is called the base type. All component data are sequentially arranged in continuous storage units. To sum up, the array structure is a linear and unified structure, and its component data can be accessed randomly.
Because of these good characteristics, this structure is most often adopted by people. In high-level languages, the data type corresponding to array structure is array type, that is, the data variables of array structure must be described as array [i] of T0, where i is the subscript type of array and structure, and T0 is the base type of array structure. Record structure is another commonly used data structure. Its characteristic is that the number of component data is fixed, just like the array structure. However, there is no natural order between component data, and they are in an equal position. Each component data is called a domain and given a domain name. Different domains have different domain names. Different domains allow different structures, so they are allowed to belong to different data types. Like the array structure, it can be accessed randomly, but the access method depends on the domain name. In high-level languages, the data type corresponding to the record structure is the record type. The data variable of the record structure must be specified as the record type.
The meaning of abstract data types has been described in the previous paragraph. It can be understood as a further abstraction of data types. That is, data types and operations on data types are bundled and packaged. The purpose of introducing abstract data types is to separate the representation of data types and the implementation of operations on data types from the references to these data types and operations in the program, so that they are independent of each other. For the description of abstract data types, we should not only describe their data structures, but also describe the operations (procedures or functions) defined on them. Procedures and functions defined on abstract data types
This number is based on the data structure that data of abstract data type should have.
Universal design, data structure and algorithm
Next, I want to talk about the latest improvement of data structure and algorithm by generic programming model. The general idea has been combined with data.
The basic idea of construction and algorithm has been abstracted to an unprecedented height. There are many programming languages that support universal design, such as
ADA, C++, and it is said that the next version of JAVA and C# will also fully support generic design.
Let's talk about the basic idea of generic design: Generic programming (hereinafter referred to as GP) is a brand-new programming idea. Different from the well-known programming ideas such as OO, OB and PO, GP is more abstract, and there is no inheritance relationship between components based on GP design, so its interoperability and expansibility are very high. As we all know, any algorithm works on a specific data structure. The simplest example is quick sorting. The most fundamental realization condition is that the sorted object is storage.
It is stored in an array, because the quick sorting is due to the random storage characteristics of the array, that is, distant objects can be exchanged in a unit time, not just two adjacent objects. However, if the linked list is used to store objects, the quick sorting will lose its quick characteristics because the time to obtain the objects in the linked list is linear. That is to say, when we design an algorithm, we always consider the data structure of its application first, such as array lookup, linked list lookup, tree lookup and graph lookup, the core of which is lookup, but because of the different data structures.
There will be many different forms of expression. This close relationship between data structure and algorithm has always been our previous understanding. The basic idea of generic design is to separate the algorithm from the data structure it acts on, that is to say, when we design the algorithm, we don't consider what kind of data structure the algorithm we design will act on. The ideal state of generic design is that a search algorithm will be able to act on various data structures such as arrays, linked lists, trees and graphs, and become a general and universal algorithm. Is this ideal attractive?
Generic programming has brought unprecedented flexibility and abstraction, while not losing efficiency. Unlike OO, GP doesn't need you to call functions through an extra indirect layer: it allows you to write completely universal and reusable algorithms, which are as efficient as those designed for specific data structures. We all know that the data structure in C++ can be expressed by user-defined types, while the template technology in C++ takes types as parameters, so I can imagine that we can realize our GP idea starting from template technology, that is, a template function can act on incoming types, which can be all kinds of data structures defined by us.
Generic algorithms are separated from specific types and specific data structures, which makes them adapt to general types as much as possible. The algorithm itself is only to realize the logical essence that the algorithm needs to express, and is not disturbed by the implementation details of various data structures. This means that a general algorithm actually has two parts. 1, an actual instruction used to describe the essential logic of the algorithm; 2. Correctly specify a set of essential requirements that its parameter type must meet. At this point, I believe many people have begun to be confused, hehe, it doesn't matter. After all, GP is a highly abstract programming idea, and its core is that abstract conditions become the core in the programming process, thus replacing the core position of types in OO. It is precisely because types are no longer the focus of our consideration and become the cloak of abstract conditions that we call such a programming idea generic thinking-generalized types.