+

WO1997011427A1 - Procede et equipement donnant une structure de fichiers cd-rom a acces rapide - Google Patents

Procede et equipement donnant une structure de fichiers cd-rom a acces rapide Download PDF

Info

Publication number
WO1997011427A1
WO1997011427A1 PCT/US1996/015228 US9615228W WO9711427A1 WO 1997011427 A1 WO1997011427 A1 WO 1997011427A1 US 9615228 W US9615228 W US 9615228W WO 9711427 A1 WO9711427 A1 WO 9711427A1
Authority
WO
WIPO (PCT)
Prior art keywords
index
blocks
record
records
cdrom
Prior art date
Application number
PCT/US1996/015228
Other languages
English (en)
Inventor
Bruce C. Atherton
Original Assignee
International Compu Research, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by International Compu Research, Inc. filed Critical International Compu Research, Inc.
Publication of WO1997011427A1 publication Critical patent/WO1997011427A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B27/00Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
    • G11B27/10Indexing; Addressing; Timing or synchronising; Measuring tape travel
    • G11B27/102Programmed access in sequence to addressed parts of tracks of operating record carriers
    • G11B27/105Programmed access in sequence to addressed parts of tracks of operating record carriers of operating discs
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B27/00Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
    • G11B27/10Indexing; Addressing; Timing or synchronising; Measuring tape travel
    • G11B27/19Indexing; Addressing; Timing or synchronising; Measuring tape travel by using information detectable on the record carrier
    • G11B27/28Indexing; Addressing; Timing or synchronising; Measuring tape travel by using information detectable on the record carrier by using information signals recorded by the same method as the main recording
    • G11B27/32Indexing; Addressing; Timing or synchronising; Measuring tape travel by using information detectable on the record carrier by using information signals recorded by the same method as the main recording on separate auxiliary tracks of the same or an auxiliary record carrier
    • G11B27/327Table of contents
    • G11B27/329Table of contents on a disc [VTOC]
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B7/00Recording or reproducing by optical means, e.g. recording using a thermal beam of optical radiation by modifying optical properties or the physical structure, reproducing using an optical beam at lower power by sensing optical properties; Record carriers therefor
    • G11B7/08Disposition or mounting of heads or light sources relatively to record carriers
    • G11B7/085Disposition or mounting of heads or light sources relatively to record carriers with provision for moving the light beam into, or out of, its operative position or across tracks, otherwise than during the transducing operation, e.g. for adjustment or preliminary positioning or track change or selection
    • G11B7/08505Methods for track change, selection or preliminary positioning by moving the head
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B2220/00Record carriers by type
    • G11B2220/20Disc-shaped record carriers
    • G11B2220/21Disc-shaped record carriers characterised in that the disc is of read-only, rewritable, or recordable type
    • G11B2220/213Read-only discs
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B2220/00Record carriers by type
    • G11B2220/20Disc-shaped record carriers
    • G11B2220/25Disc-shaped record carriers characterised in that the disc is based on a specific recording technology
    • G11B2220/2537Optical discs
    • G11B2220/2545CDs

Definitions

  • This invention relates to computer storage devices, and more particularly to a method and apparatus for a fast access CDROM file structure for indexed databases.
  • an inverted index for a full-text database provides a pointer or set of pointers to every significant term in the database (high frequency "noise” words, like “the”, “it”, etc., are sometimes omitted). Accordingly, it is common to use an inverted index in conjunction with a text database to permit easy access to specific documents and subportions of documents pointed to by the index.
  • An inverted index includes a lookup table or initial index to locate selected entries in the inverted index.
  • the initial index typically is a sorted table of counts and offsets referencing the inverted index.
  • the offset defines a location in the inverted index at which a desired term is referenced, while the count defines the number of such terms in sequence.
  • the inverted index itself typically includes a document or reference number, and a vector pointing to the location in the database of the actual document or text containing the selected term.
  • FIGURE 1 is a block diagram showing the logical structure of a typical prior art inverted index and initial index in the context of a database storing legal case law documents.
  • the original text in this example, legal cases
  • index i.e., noise words and any desired special terms are ignored
  • the result of such a search typically comprises a document identifier and a vector pointing to the original text for each significant term (a vector may point to the actual location within a document of a significant term).
  • the inverted index comprises a data file 10 typically having fixed length records or entries 11 , as shown in FIGURE 1.
  • Each inverted index record 11 is shown in FIGURE 1 with a record address, shown as the first number of the pair of numbers in the DOC column, and content, shown in parentheses in the DOC column.
  • the record address is not normally stored with the record, but is only a referent to the content of the record.
  • record #121 has a document identifier content of 1011 in the example shown).
  • Each record 11 of the inverted index data file 10 stores a document identifier and an associated vector, as determined in the index generation process.
  • the initial index also comprises a data file 15 typically having fixed length records or entries 16.
  • Each initial index record 16 typically has two values, a count value and an offset value.
  • the offset defines a location in the inverted index at which a desired term is referenced, while the count defines the number of such terms in sequence.
  • Each initial index record 16 is shown in FIGURE 1 with a record address, shown as the first number ofthe pair of numbers in the COUNT column, and content, shown in parentheses in the COUNT column. Again, the record address is not normally stored with the record, but is only a referent to the record. Thus, record #21 has a count of 4 in the example shown).
  • records 11 for document identifiers that point to documents containing the same terms are typically grouped together. For example, if four cases contain the term "LARCENY", the records 11 for those four cases are grouped together in the data file 10 comprising the inverted index. A record 16 would then be created in the initial index data file 15 with an offset value pointing to the first such inverted index record 11, and a count value indicating the number (four) of sequential related inverted index records 11.
  • a search term or query is entered by a user.
  • a user wants to find every case in a legal database containing the term "LARCENY".
  • the user's input can be parsed against a lookup table or query dictionary, to find a value associated with the selected term.
  • Such a query dictionary 20 is shown in FIGURE 1.
  • "LARCENY" is associated with the query number 21.
  • the query number is used as a pointer to a corresponding record address in the initial index data file 15.
  • entry of "LARCENY” points to an initial index record 16 that indicates four relevant inverted index records 11 are located beginning at address 121 of the inverted index data file 10.
  • Other inverted index file structures are possible, but most have characteristics and functionality similar to that described above.
  • CDROM 's Compact Disc Read Only Memory's
  • CDROM 's are capable of storing large amounts of data, and thus are frequently used for storing such databases and associated index files.
  • each index is stored as a separate file each comprising a plurality of physically contiguous blocks or clusters.
  • FIGURE 2 is a stylized diagram of the physical layout on a CDROM disk 40 of a typical prior art inverted index data file 10 and initial index data file 15.
  • Two "tracks" 41 , 42 of data on the CDROM disk 40 are shown, as are two blocks 43, 44 defining regions for storing data (in actuality, a CDROM uses a single spiral track to store data, but stored data on the spiral track is accessible as blocks in a manner similar to magnetic disk drives).
  • the inverted index data file 10 is stored contiguously (for example, in track 41) while the initial index data file 15 is stored contiguously (for example, in track 42) but separate and apart from the inverted index data file 10.
  • the corresponding text database would also be stored on the CDROM disk 40.
  • CDROM systems use a variable speed motor to maintain constant linear velocity along the single spiral track of data.
  • the motor must speed up or slow down if the transducer head is moved in or out (known as a "lateral seek") along a radius of the disk 40.
  • Block identification along a track is slow because of the lack of unique tracks, and so "longitudinal seeks" along a track (i.e., from block to block) also take time.
  • longitudinal seeks are much faster than lateral seeks, and are typically determined by the rotational latency of disk, which is constant due to the constant linear velocity characteristic of CDROM systems.
  • a "double speed" CDROM system has approximately one-half of the rotational latency of a "single speed” CDROM system).
  • the transducer mass of a CDROM drive is relatively large. Therefore, average seek times across a CDROM are very long (typically 20 to 40 times) compared to average seek times for magnetic disk drives or magneto-optical drives. Further, average lateral seek time is typically many times the average longitudinal seek time for a typical "double speed" CDROM system.
  • LARCENY a first CDROM lateral disk seek and one disk read are required to access the corresponding record 16 in the initial index data file 15.
  • the offset field in the retrieved record 16 is then used as a pointer to the inverted index data file 10, generally requiring a second lateral disk seek.
  • the number of disk reads from the inverted index data file 10 depends on the count value in the retrieved initial index record 16, but is a minimum of one.
  • the present invention provides such an improvement for an inverted indexed database stored on a CDROM.
  • the present invention is a method and apparatus for a fast access CDROM file struc ⁇ ture for indexed databases.
  • the invention takes advantage of the sequential, multi- step look-up process for inverted indexes by physically interleaving an initial index with its corresponding inverted index on a CDROM disk.
  • a speed improvement in accessing the indexes is realized by ensuring that the second seek in the process described above is almost always a longitudinal seek to a block following the block read after the first seek.
  • lateral seeks are substantially reduced, thereby reducing average seek time.
  • FIGURE 1 is a block diagram showing the logical structure of a typical prior art inverted index and initial index in the context of a database storing legal case law documents.
  • FIGURE 2 is a stylized diagram of the physical layout on a CDROM disk of a typical prior art inverted index data file and initial index data file.
  • FIGURE 3 is a stylized diagram of the physical layout on a CDROM disk of an inverted index data file and initial index data file in accordance with the present invention.
  • FIGURE 3 is a stylized diagram of the physical layout on a CDROM disk 40 of an inverted index data file and initial index data file in accordance with the present invention. Two “tracks" 41 , 42 of data on the CDROM disk 40 are shown.
  • blocks from each file are physically interleaved longitudinally (i.e., along) the spiral track (or, if CDROM technology changes, along concentric tracks).
  • blocks 43a and 43b are used to store the inverted index data file 10
  • blocks 44a and 44b are used to store the initial index data file 15.
  • the CDROM disk 40 is readable by a conventional CDROM drive having a drive motor, transducer head, and read and control circuitry.
  • a speed improvement in accessing the indexes is realized by ensuring that the second of the two required seeks in the access process is almost always a longitudinal seek to a second data block following a first data block read after the first seek (which may be a lateral seek). (In some circumstances, a lateral seek may be required to access the second block, but the interleave of blocks means that most probably such would not be the case.) Thus, lateral seeks are substantially reduced, thereby reducing average seek time.
  • the user enters a query (e.g., "LARCENY"), which is parsed into a query number (21).
  • a query e.g., "LARCENY”
  • the query number is used to initiate a first seek (which may be a lateral seek) to a block in the initial index data file 15.
  • a first seek which may be a lateral seek
  • An initial index record 16 (record 21) corresponding to the query number (21) is read from the block to obtain a count (4) and an offset pointer (121) to the inverted index data file 10.
  • the offset value is used to initiate a second seek to a block in the inverted index data file 10. If the inverted index record 11 pointed to by the offset pointer is physically located in the next or a nearby (e.g. , same circumferential track) block after the block containing the read initial index record 16, in accordance with the present invention, only a longitudinal seek is required.
  • each index occupies the same number of blocks
  • the number of blocks on a circumferential "track" (taking into account the spiral nature of a CDROM disk track) dedicated to one index may be larger or smaller than the number of blocks dedicated to the other index.
  • the initial index data file 15 comprises numerous small records
  • the inverted index data file 10 comprises fewer larger records
  • this approach attempts to match the number of inverted index data file 10 blocks with the number of initial index data file 15 blocks containing records 16 with offset pointers corresponding to such inverted index data file 10 blocks.
  • 100 initial index data file 15 blocks contain all of the offset pointers to 10,000 inverted index data file 10 blocks, on average, one initial index data file 15 block would be interleaved for every ten inverted index data file 10 blocks.
  • the ratio of inverted index blocks to initial index blocks need not be constant, as were the underlying database contains a high frequency of a certain significant terms compared to the frequency of other significant terms.
  • one particular initial index block may contain records pointing to a large number of inverted index blocks having few distinct terms (i.e., the count value is high, indicating numerous contiguous inverted index records 11).
  • another particular initial index block may contain records pointing to a small number of inverted index blocks having many distinct terms (i.e., the count value is low, indicating few contiguous related inverted index records 11). Accordingly, by mapping initial index block records to inverted index records, a "custom" interleave for a particular database can be generated.
  • the distribution of initial index data file 15 blocks to inverted index data file 10 blocks is empirically determined on a case-by-case basis, taking into consideration the size of the two files, the size of the records in each file, and the distribution of inverted index records 11.
  • the records 11 of each inverted index data file 10 block should be distributed on the CDROM closely following the initial index data file 15 block containing the associated pointers for such records 11.
  • Other algorithms can be devised for determining the "distribution frequency" of initial index data file 15 blocks to inverted index data file 10 blocks, but all come within the scope of the inventive concept described above.
  • the blocks 43a, 43b, 44a, 44b can be stored on a magnetic disk and fetched in the selected interleave order for writing to a recordable CDROM or to a writable master CDROM for mass-production.
  • the blocks 43a, 43b, 44a, 44b can be stored on a magnetic disk in the selected interleave order and then written as a single data stream to a recordable CDROM or to a writable master CDROM for mass-production.
  • the invention takes advantage of the sequential, multi-step look-up process for inverted indexes by physically interleaving an initial index with its corresponding inverted index on a CDROM disk.
  • a speed improvement in accessing the indexes is realized by ensuring that the second of two required seeks in the process is almost always a longitudinal seek to a data block following a data block read after the first seek.
  • lateral seeks are substantially reduced, thereby reducing average seek time.
  • the invention is also generally applicable to any similar arrangement of data on a
  • CDROM or similar device, where a first seek to and read of records of one file is followed by a seek to the records of a second file.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Signal Processing For Digital Recording And Reproducing (AREA)

Abstract

La présente invention concerne un procédé et un équipement donnant une structure de fichiers CD-ROM à accès rapide, pour les bases de données indexées. L'invention tire parti d'un procédé de consultation multipas séquentiel pour index inversés en imbriquant matériellement un index initial avec son index inversé correspondant sur un disque CD-ROM. On améliore la vitesse, dans l'accès aux index, en faisant en sorte que la deuxième de deux recherches demandées, dans le procédé, soit presque toujours une recherche longitudinale allant vers un bloc de données qui suit un bloc de données lu après la première recherche. Ainsi, le nombre de recherches latérales est réduit dans une mesure substantielle, ce qui réduit le temps de recherche moyen.
PCT/US1996/015228 1995-09-22 1996-09-23 Procede et equipement donnant une structure de fichiers cd-rom a acces rapide WO1997011427A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US53227295A 1995-09-22 1995-09-22
US08/532,272 1995-09-22

Publications (1)

Publication Number Publication Date
WO1997011427A1 true WO1997011427A1 (fr) 1997-03-27

Family

ID=24121082

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US1996/015228 WO1997011427A1 (fr) 1995-09-22 1996-09-23 Procede et equipement donnant une structure de fichiers cd-rom a acces rapide

Country Status (1)

Country Link
WO (1) WO1997011427A1 (fr)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4687604A (en) * 1985-09-17 1987-08-18 Goettl Adam D Floor pan for evaporative coolers
US5561649A (en) * 1992-12-17 1996-10-01 Samsung Electronics Co, Ltd. Disk recording medium and reproduction method and apparatus thereof

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4687604A (en) * 1985-09-17 1987-08-18 Goettl Adam D Floor pan for evaporative coolers
US5561649A (en) * 1992-12-17 1996-10-01 Samsung Electronics Co, Ltd. Disk recording medium and reproduction method and apparatus thereof

Similar Documents

Publication Publication Date Title
US4991087A (en) Method of using signature subsets for indexing a textual database
US5375235A (en) Method of indexing keywords for searching in a database recorded on an information recording medium
US5664184A (en) Method and apparatus for implementing Q-trees
US6014733A (en) Method and system for creating a perfect hash using an offset table
US6725223B2 (en) Storage format for encoded vector indexes
US6658437B1 (en) System and method for data space allocation using optimized bit representation
US5734892A (en) Efficient method and apparatus for access and storage of compressed data
JPH09212528A (ja) データベースを記憶する方法、データベースからレコードを検索する方法、および、データベース記憶/検索システム
US7231383B2 (en) Search engine for large-width data
Baeza-Yates et al. Hierarchies of indices for text searching
US5608905A (en) DOS and Macintosh preformatted computer storage media
WO1997011427A1 (fr) Procede et equipement donnant une structure de fichiers cd-rom a acces rapide
Zobel et al. Storage Management for Files of Dynamic Records.
JPS63104284A (ja) デイスクフアイルアクセス方式
CN104834664A (zh) 面向光盘库的全文检索系统
KR100489412B1 (ko) 외부기억장치의데이타출력방법
JP3145727B2 (ja) データの検索装置
JP3348279B2 (ja) プライスルックアップデータ検索回路及びその検索方法並びにその制御プログラムを記録した記録媒体
JPH06295313A (ja) インデックス付き検索ファイルのデータ検索装置
CN117290390A (zh) 一种基于特殊索引内存映射在大数据检索上的方法
JPS59183450A (ja) フアイル管理方式
Pramanik et al. HCB_tree: a height compressed B_tree for parallel processing
JPS61103242A (ja) 高速検索方式
JPS63184960A (ja) デ−タ管理方式
JPH04276836A (ja) 可変長データの記録方法とその検索方法

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): CA

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): AT BE CH DE DK ES FI FR GB GR IE IT LU MC NL PT SE

121 Ep: the epo has been informed by wipo that ep was designated in this application
122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: CA

点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载