Serial and binary search methods in advanced data structures
Furthermore, the records are often organized using data structures that provide a sequential record order, based on values contained within the records. For example, a telephone company might maintain a file or data set that contains the name of each customer in a particular service area, and that file might be maintained in alphabetical order, according to customer name. Primitive database systems provide ordering for such files by simply keeping records in sequential physical memory locations such as on tape or in a disk drive according to the desired order.
Such an organization is essentially an array, and it is inefficient, and therefore rarely used, when processing large files because records usually cannot be added and deleted without altering a large percentage of the utilized storage.
Other database systems employ more sophisticated data structures that utilize pointers to order records. Linked lists and binary search trees BSTs are two such data structures, and both allow records to be added and deleted more efficiently, relative to arrays.
However, when data sets are searched for particular values a step that is prerequisite to adding and deleting records in any ordered data set , binary search trees are much more efficient than linked lists. A BST is considered balanced when, for each node, the number of nodes in the left subtree of that node and the number of nodes in the right subtree of that node differ, at most, by one.
Numerous methods for balancing binary search trees have been developed, however those methods present a number of disadvantages. As suggested by New Algorithms, an advantage provided by such a technique is its adaptability for parallel processing.
Among the disadvantages associated with the New Algorithms technique, however, are the number and complexity of instructions used to perform the balancing. Furthermore, as the number of records to be balanced increases, so does the time associated with executing a lengthy and complex tree balancing utility.
In fact, tests have shown that, when balancing a BST containing millions of records, the difference between inefficient and efficient balancing facilities can mean the difference between spending over an hour balancing the tree and spending less than a minute. The present invention recognizes that the need for an efficient technique for balancing BSTs becomes more pressing as databases containing millions of records become more commonplace.
To address the shortcomings of conventional tree balancing facilities, the present invention provides a system, method, and program product that balances BSTs by copying pointers to the nodes of a BST into a pointer list in accordance with a sequential order of respective data values of the nodes. The balacing facility then builds a balanced BST based on a first index to a first pointer of the pointer list and a second index to a last pointer of the pointer list.
In an illustrative embodiment, the balanced BST is built by identifying a central pointer at a midpoint of the pointer list, a left range of pointers before the midpoint, and a right range of pointers after the midpoint.
The central pointer is then interpreted as a pointer to a root node, and a balanced left subtree of the root node and a balanced right subtree of the root node are built based on the left range of pointers and the right range of pointers, respectively. All objects, features, and advantages of the present invention will become apparent in the following detailed written description.
The novel features believed characteristic of the invention are set forth in the appended claims. The invention itself, however, as well as a preferred mode of use, further objectives, and advantages thereof, will best be understood by reference to the following detailed description of an illustrative embodiment when read in conjunction with the accompanying drawings, wherein:. With reference now to the figures, and in particular with reference to FIG.
PCI local bus is connected to one or more additional non-volatile data storage devices, such as a disk drive , and to an audio adapter and a graphics adapter for controlling audio output through a speaker and visual output though a display device , respectively. An expansion bus bridge , such as a PCI-to-ISA bus bridge, connects PCI local bus to an ISA bus , which is attached through appropriate adapters to a keyboard and a mouse for receiving operator input.
DPS may also include data ports for communicating with external equipment, such as other data processing systems. Those data ports may include, without limitation, a serial port attached to ISA bus for linking DPS to remote data processing systems via a modem not illustrated and a network adapter attached to PCI bus for linking DPS into a local area network not illustrated. Stored on disk drive are at least one set of data, such as a Customer-Name BST , and a database system Database system is loaded into RAM and executed on CPU to provide an interface that allows records to be read from, added to, and deleted from Customer-Name BST by conventional application programs, such as a Customer Account Maintenance application not illustrated.
When Customer-Name BST is balanced, searches of Customer-Name BST for particular records are efficient, in that such searches have a time complexity of log 2 n , where n equals the number of nodes in the tree.
As records are added to and deleted from the tree, however, the tree will likely become unbalanced, and the time complexity for searching the tree for a particular value will consequently increase towards n. Therefore, in order to maintain optimal search times, it is necessary to rebalance Customer-Name BST , either periodically such as every night or in response to a determination that the tree has exceeded an acceptable level of unbalance such as by monitoring search times or examining the topography of the tree.
Consequently, a tree balancing facility according to an illustrative embodiment of the present invention is stored on disk drive and, when rebalancing is desired, that program is loaded into RAM and executed on CPU to return Customer-Name BST to a balanced state. With reference now to FIG. At the highest level of the diagram are the application programs , including database system and tree balancing facility Preferably, tree balancing facility includes two major components: At the intermediate level is an application program interface API , through which application programs request services from the operating system Operating system , which occupies the lowest level of the diagram, manages the operations of data processing system by performing duties such as resource allocation, task management, and error detection.
Included in operating system is a kernel that manages the memory, files, and peripheral devices of data processing system The lowest level also includes device drivers, such as a keyboard driver and a mouse driver , that kernel utilizes to manage input from and output to peripheral devices. Referring now to FIGS. Customer-Name BST include a tree pointer and a set of six records or nodes arranged in four levels. In the illustrative embodiment, tree pointer is stored in disk drive at location or address 1 , and, as shown in FIG.
Each node of Customer-Name BST , including root node , contains a data value, a left pointer, and a right pointer. Each left pointer and each right pointer contains either a pointer value for a left or right subordinate node, respectively, or is null i.
Accordingly, the right subtree of root node starts at i. In the illustrative embodiment, the left subtree of root node has four nodes, while the right subtree has only one node; therefore, Customer-Name BST is not balanced. With reference now to FIGS. Balancing facility begins by accepting a pointer to a tree to be balanced and creating an array just large enough to hold a node pointer for each node in that tree. Then, as described in greater detail below, balancing facility calls extraction facility to build a pointer list and tree builder to produce a balanced tree based on that pointer list.
As shown in FIG. Referring now to FIG. Instruction trace , which illustrates parameter values in square brackets in place of parameter labels, begins when balancing facility calls extraction facility with tree pointer As shown by the lines with a recursion level indicator of 1 in the left margin, extraction facility builds a pointer list by a using recursion i.
However, as shown by the boldface lines at recursion levels 5 L, 5 R, 4 R, 4 L, 4 R, 3 L, and 3 R, whenever the input node pointer is null, extraction facility executes a return to the caller without altering the counter or the pointer list. As shown, after extraction facility has finished, pointer list contains the addresses of the nodes of Customer-Name BST in data value sequence. It should be noted, however, that, although the data value of each node e. Preferably, the data values are not stored in the pointer list, as to do so would be inefficient.
Instruction trace , which also presents parameter values in square brackets, begins when balancing facility calls tree builder with the first and last index values for pointer list i.
As shown by the lines with a recursion level indicator of 1 in the left margin, tree builder balances Customer-Name BST by a identifying a central pointer at a midpoint of pointer list , b interpreting that pointer as a pointer to a current root node, c using recursion to build balanced left and right subtrees for that current root node, and d returning the central pointer to the caller.
However, as shown by the boldface lines at recursion levels 3 L, 4 L, 4 R, 4 L, 4 R, 4 L, and 4 R, whenever an empty or null range of addresses has been received i. As shown, tree pointer points to a new root node , and the left and right pointers of many of the nodes within the tree have been modified so that Customer-Name BST is now balanced i. Among other things, the present invention may be embodied as methods or devices. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects.
The following detailed description is, therefore, not to be taken in a limiting sense. In general, the present invention relates to a system and method for generating an ink document structure and storing the ink document structure so that it is accessible by other applications. More specifically, the present invention relates to a system and method for storing data in a serial binary format to increase the efficiency of the storage.
The present invention also relates to a system and method for modifying or altering a portion of an ink without requiring reanalysis of the entire inking. Even though the description set forth herein references storing and loading an ink document, the serial binary format referenced herein may be used to store other types of data. For example, the present invention may include data associated with a word processing application, spreadsheet application, drawing application, graphics application, notes application, picture application or the like.
Succinctly stated, the serial binary format may be used to store any type of data that is associated with a binary tree structure. As general context of one aspect of the present invention, an ink application may provide real-time visual feedback when a user employs a pen to input data.
Inking, however, may include much more than a visualization of pen strokes; it may include a data type. A user may build applications for a digitizer that supports various levels of functionality for pen, ink, ink parsing, and ink recognition.
Such applications range from recognizing simple text input to creating and editing complex ink documents. Ink applications may also include ink-to-text conversion. In some situations, an application may not accept direct ink input. In such a situation, an ink application may implement handwriting recognition and convert the ink to text so that it may be cut and pasted into an application that does not accept direct ink input.
Applications may also recognize ink objects and their context in relation to other document objects. Other embodiments allow a user to manipulate ink and use the ink to author rich documents that contain text, graphics, vector shapes, multimedia objects and the like. Such embodiments handle ink as a data type that has the capability of reflowing and overlying ink objects. Ink inputs may be associated with an application in the form of raw ink data. In one embodiment, the raw ink data may be sent to an ink analyzer to process the raw ink data and generate an ink document structure that may be separate from the raw ink data.
The ink analyzer may implement parsing and recognition processes in order to divide the raw ink data into manageable stroke components. As more fully set forth below, in one embodiment, the ink analyzer may generate an ink document structure having a binary tree where each node of the tree defines a relationship to the parts of the raw ink data.
The ink document structure allows ink applications associated with a platform to relate the raw ink data and the ink document structure in order to load the original ink and the associated ink document structure. The ink document structure also allows a user to load and modify the ink without requiring reanalysis of the entire ink document. Also, the present invention may allow ink to be shared between multiple applications on a platform. FIGURE 3 represents a general overview of one exemplary embodiment of a system for storing ink document data in a serial binary format.
As illustrated, system includes digitizer , application , and ink analyzer The digitizer , however, may include any device that facilitates the operation of an ink application. Digitizer digitizes user input strokes e. Raw data storage is any type of storage capable of maintaining data from digitizer In another embodiment, digitizer bypasses raw data storage and transmits the digitized data to application Application may include any application associated with a platform.
In one embodiment, application is an application that facilitates ink. Application may include a word processing application, a paint application, a drafting application, a drawing application, a credit card signature application or the like. In another embodiment, application is capable of performing a save operation and a load operation. The save operation may include saving ink data and non-ink data. Application may save raw ink data in raw data storage , and application may save an ink document structure to an ink analyzer that is associated with a platform.
During a load operation, application may load and integrate the ink document structure and the raw ink data as will be further set forth below. Ink analyzer may be configured to receive raw ink data from application Ink analyzer is configured to perform structural analysis on the raw ink data in order to generate an ink document structure.
The structural analysis may include parsing the raw data and recognition of the raw data. In one embodiment, the structural analysis may facilitate text recognition, writing and drawing classification, and layout analysis. Ink analyzer may include a parsing component and a recognizer component that operate in coordination to enhance text recognition.
For example, a parser may perform operations as a pre-processing step before the ink is sent to a recognizer. The pre-processing allows the parser to parse and "clean" multi-lined ink and send it to the recognizer one parcel at a time.
A parcel may include a portion of the ink document. The parser may be further configured to correct incorrect input stroke order information to ensure that all strokes are recognized regardless of the order of input. Also, the parser may generate information about neighboring lines. For example, the fact that two neighboring lines start with a bullet may be a strong indicator that a current line starts with a bullet.
In another embodiment, the parsing operation of ink analyzer may also include classifying ink as a drawing or writing. A writing may include any ink stroke that facilitates a word. A drawing stroke may include anything that is not a writing stroke. In this manner, in accordance with one embodiment, writing strokes may be the only strokes sent to the recognizer.
In yet another embodiment of ink analyzer , the layout analysis includes a break down of writing and drawing strokes in relation to one another and non-ink data.
Once ink analyzer analyzes the strokes of an inking, a tree representation i. Succinctly stated, ink analyzer may include any type of analyzer that is capable of storing a document in a binary tree and making the binary tree accessible to other applications via a serial binary format. Even though the serial binary format is described herein with reference to an ink document structure, the serial binary format may be used to store any type of information associated with a document tree structure.
Once ink analyzer has generated the ink document structure based on the raw data, the ink document structure is made available to application The ink document structure may include a live ink document structure. When a store operation is instigated, the application requests that ink analyzer store the ink document structure. In that ink analyzer is a platform component the ink document structure is available to other ink applications.
For example, if a user generates ink in a word processing document, this ink may be cut and pasted to a drawing application without having to be reanalyzed. In this example, the drawing application will understand how to generate the original ink from the ink document structure, Also, since the ink is parsed and saved in a serial binary format discussed below , the ink may be modified and efficiently stored without requiring the entire ink to be reanalyzed.
The modified portion may correspond to a single parcel of the ink document structure and, therefore, only require a reanalysis of the changed parcel. In general, during a load operation, application may load the raw ink data, non-ink data and the ink document structure. The raw ink data may be loaded from raw data storage The non-ink data may also be loaded from raw data storage It is contemplated, however, that the non-ink data is loaded from any storage associated with application The ink document structure may be loaded from ink analyzer , which may be a platform component.
In one embodiment, application associates the raw data and the ink document structure so that the ink is loaded without requiring reanalysis. Inking may be associated or have relationships with text, drawings, tables, charts and the like.
Also, inking may include various types of writings, drawings, shapes, languages, symbols and skews. As more fully described below, inking may include a plurality of inputs that correlate to a plurality of nodes of an ink document structure.
For example, reference number indicates a writing region. As another example, reference number indicates an alignment level. Illustrated in FIGURE 4, the first and last line of inking are indented to the same level, and therefore, indicate alignment level The middle line of inking is indented inward, and therefore, indicates another alignment level. In yet another example, reference number indicates a paragraph and reference number indicates a line.
Inking also includes word , and although not shown, word may also include a stroke. A stroke may include a portion of a word. The exemplary ink document structure relates to the exemplary inking Ink document structure is but one example of an ink document structure.
Any type of tree structure may be implemented that facilitates the representation of a data structure. Ink document structure may also include drawing node , hint node , and one or more link. In that drawing is associated with the words "Mr. Similarly, reference number represents one type of hint. In one embodiment, hint includes a hint box. Hint may indicate that the input will be a number, letter, symbol, structure, code, order or the like.
For example, in FIGURE 4, the hint may include a hint that the input will be a number that is not greater than three digits. Accordingly, an ink analyzer will not mistake the "5" for a "S".
In that hint is associated with the writing "35", hint node may be associated to word node through a link as depicted in FIGURE 5. The above example is for exemplary and descriptive purposes only. In this manner, inking may be represented as ink document structure through nodes.
For example, a stroke node not shown may be a child of word node Word node may be a child of line node , and line node may be a child of paragraph node Likewise, paragraph node may be a child of alignment level node , and alignment level node may be a child of writing region node In this manner, root node may contain all the information of its children nodes.
In one embodiment, the entire inking may be represented in reference to root node Any number of nodes may be associated with any type of document as long as they facilitate the representation of the document in a document tree structure.
FIGURE 6 represents one exemplary embodiment for internally storing a document structure in serial binary format Even though an ink document structure is referenced herein, serial binary format may be used to store any type of tree document structure. When the ink document structure is generated, one ore more strings will exist that relate to the document structure.
It is contemplated, however, that the strings may be compressed by any compression format that reduces the size of the strings. FIGURE 6 includes an expanded view of the storage of data of which some data is optionally stored. In one embodiment, storage includes MultiByteEncoded "MBE" values, which facilitate the storage of unsigned integers to save storage space.
Serial binary data block includes serialized binary data for an ink document and is represented by data blocks Data blocks represent an expanded view of the whole serial binary data block Size data may be the first information that is stored in the serial binary data block Size data includes data associated with the size of the ink document structure. Ink document descriptor data may follow size data Ink document descriptor data may include any type of data that associates an expectance with regard to the type of data included in serial binary data block This expectancy may be indicated by a set of flags that represent the associated data available in an ink document structure.
The flags may indicate any data that is available in the serial binary data block Data blocks are but a few examples of data that may be associated with an ink document structure. In one embodiment of the present invention, root node data further described below is always associated with a flag in ink document descriptor Dirty region data is optional data that may not be associated with every ink document structure. Dirty region data refers to data in the ink document structure that is not fully analyzed before saving.
Dirty region data may refer to both ink data and non-ink data such as TextWord, Image and the like. Dirty region data may be indicated by a flag associated with ink document descriptor data When the ink document descriptor data includes a flag that indicates a dirty region, the flag indicates that the ink document structure has a finite, non-empty dirty region.
If dirty region data exists, this data may be represented as a series of rectangles, which are stored in a binary format to facilitate the recreation of the dirty region.
In the situation where the ink document is fully analyzed, the dirty region data may not be present and not require a flag in ink document descriptor In one embodiment, dirty region data if present immediately follows ink document descriptor data In one embodiment, dirty region data is stored to the serial binary data block as region data.
Region data format may be used for storing dirty region data , location data for non-ink leaf context nodes or location for hint nodes. Region data may include an array of individual rectangles that define the whole area of region data. In order to properly reconstruct a region data object e. For every rectangle, the region data may include information regarding top data, left data, width data, and height data. One example of a representation of persisted region data is as follows: The ink document structure or any individual node in the document tree structure may contain arbitrary data that is identified by a GUID.
The arbitrary data may include known data types and data types that are associated with a particular application. For data that is associated with a particular application i. GUID table data corresponding to any custom property data at the ink document level or context node level and may be subsequently referred to via MBE, zero-based indices in relation to GUID table data As an example, non-predefined GUIDs may include application specific extended node types and application specific extended properties on nodes.
In the situation where GUID table data is present in relation to serial binary data block , the presence is identified by a flag that is related to the document descriptor data Likewise, if GUID table data is not present, a flag is not set in ink document descriptor data One example of a representation of persisted GUID table data is as follows: String table data is optional data that may not be associated with every ink document structure. String table data may be associated with analysis hint suffix data, prefix text data, factoid data, hint name data , word list data, custom node link data, and recognized string data.
With regard to one aspect of the invention, string table data may include duplications. In so far as the ink document structure is loaded in a particular sequence, maintaining an index to string table data allows loading of the appropriate string data from string table data An index may not be written every time a string is associated with string table data In such a situation, at least on byte per instance is saved.
Moreover, the strings in string table data may be LZW compressed. By not writing an index for every string in combination with LZW compression, the size of the string may be substantially reduced. In the situation where string table data is present in relation to serial binary data block , the presence is identified by a flag that is related to ink document descriptor data Likewise, if string table data is not present, a flag is not set in ink document descriptor data One example of a representation of persisted string table data is as follows:.
In one aspect, root node data is mandatory data that is associated with every ink document structure even if root node data is empty. A flag associated with ink document descriptor data may indicate the presence of root node data Link data is optional data that may not be associated with every ink document structure. Link data includes data that indicates whether or not any nodes of the ink document structure are linked to other nodes in the same ink document structure.
Link data may be maintained globally in association with the ink document structure. In storing link data , link data may include a count of the number of links associated with the ink document structure. Individual link data may also include the MBE size of the data. In one aspect, the MBE size data is followed by a link descriptor, which identifies the type of link and origin information. The source node index and the destination node index identify the source node and destination node, respectively.
In yet another aspect, if the link descriptor data indicates that link data includes a custom link, the custom link data is read from a global string table that is identified by an index in the global string table.
In the situation where link data is present in relation to serial binary data block , the presence is identified by a flag that is related to the ink document descriptor data Likewise, if link data is not present, a flag is not set in ink document descriptor data One example of a representation of persisted link data is as follows:. Custom property data is optional data that may not be associated with every ink document structure.
Custom property data may be associated with the ink document structure, and in one aspect, is stored as custom property data associated with a node. Custom property data may include any arbitrary data that an application associates with a node. In storing custom property data , a flag may identify custom property data as a known value. In another aspect, storing custom property data includes an index to GUID table data The storage of custom property data may also include the MBE value of the size of the data and an array of bytes that represent the data.
In the situation where custom property data is present in relation to serial binary data block , the presence is identified by a flag that is related to ink document descriptor data Likewise, if custom property data is not present, a flag is not set in ink document descriptor data One example of a representation of persisted ink document structure is as follows: In one embodiment, root node data is a context node and stored as context node data Context node data may be included in the serialized binary data for an ink document and is represented by data blocks Data blocks represent an expanded view of context node data Node descriptor data may include data that is associated with each node of an ink document structure.
Node descriptor data may be indicated by a collection of flags that define the configuration of the node data as well as the types of nodes associated with the ink document structure.
Node size data may include possible known properties that are stored on a particular node e. In one aspect, node size data may immediately follow node descriptor data Succinctly stated, node descriptor data may indicate the size of the entire context node tree. Node location data is optional data that may not be associated with every node type. In the situation where node location data is present, the presence is identified by a flag that is related to node descriptor data Likewise, if node location data is not present, a flag is not set in node descriptor data In one aspect, if node descriptor data indicates a non-ink leaf node, node location data may follow.