EP1611528A2

EP1611528A2 - Method and device for data processing

Info

Publication number: EP1611528A2
Application number: EP04725695A
Authority: EP
Inventors: Martin Vorbach
Original assignee: PACT XPP Technologies AG
Current assignee: PACT XPP Technologies AG
Priority date: 2003-04-04
Filing date: 2004-04-05
Publication date: 2006-01-04
Also published as: WO2004088502A2; US20100122064A1; WO2004088502A3; US20070011433A1; JP2006524850A; DE112004000026D2

Abstract

The invention relates to a data processing device with a data processing logic cell field and at least one sequential CPU, wherein a coupling of the sequential CPU to the data processing logic cell field, for data exchange, particularly in block form, by means of lines leading to a cache memory is provided.

Description

Title: Method and device for data processing

description

The present invention relates to the preamble and thus deals with improvements in the use of reconfigurable processor technologies for data processing.

With regard to the preferred structure of logic cell fields, reference is made to the XPP architecture and the previously published and recent property right applications of the present applicant, which are fully incorporated for the purposes of disclosure. It should therefore be mentioned in particular. DE 44 16 881 AI, DE 197 81 412 AI, DE 197 81 483 AI, DE 196 54 846 AI, DE 196 54 593 AI, DE 197 04 044.6 AI, DE 198 80 129 AI, DE 198 61 088 AI, DE 199 80 312 AI, PCT / DE 00/01869, DE 100 36 627 AI, DE 100 28 397 AI, DE 101 10 530 AI, DE 101 11 014 AI, PCT / EP 00/10516, EP 01 102 674 AI, DE 198 80 128 AI, DE 101 39 170 AI ,. DE 198 09 640 AI, DE 199 26 538.0 AI, DE 100 50 442 AI, and PCT / EP 02/02398, DE.102 40 000, ^' DE 102 02 044, DE 102 02 175, DE 101 29 237, DE 101 42 904, DE 101 35 210,

EP 01 129 923, PCT / EP 02/10084, DE 102 12 622, DE 102 36 271, DE 102 12 621, EP 02 009 868, DE 102 36 272, DE 102 41 812, DE 102 36 269, DE 102 43 322, EP 02 022 692, as well as EP 02 001 331 and EP 02 027 277.

A problem with conventional approaches to reconfigurable technologies is when data processing is primary to be made on a sequential CPU consulting a configurable data processing logic cell array or the like and / or a ^'data processing is desired, sequentially auszufüh- in the many and / or large-saving processing steps are present.

Approaches are known which deal with how data processing can take place both on a configurable data processing logic cell array and on a CPU.

For example, WO 00/49496 discloses a method for executing a computer program with a processor, which comprises a configurable functional unit which is capable of executing reconfigurable instructions, the effect of which can be redefined at runtime by loading a configuration program, the Method comprising the steps of selecting combinations of reconfigurable instructions, a respective configuration program for. every combination is generated and the computer program is executed.

Each time an instruction from one of the combinations is used during execution and the configurable functional unit is not configured with the configuration program for this combination, the configuration program for all of the instructions of the combination should be loaded into the configurable functional unit. Furthermore, from WO 02/50665 AI a data processing device with a configurable functional unit is known, the configurable functional unit serving to execute an instruction according to a configurable function. The configurable functional unit has a large number of independent configurable logic blocks to perform programmable logic operations to implement the configurable function. Configurable connection circuits are provided between the configurable logic blocks and both the inputs and the outputs of the configurable functional unit. This allows the distribution of logic functions to be optimized via the configurable logic blocks.

A problem with conventional architectures is when a connection is to be made and / or technologies such as data streaming, hyperthreading, multithreading and so on are to be used in a meaningful and performance-enhancing manner. A description of an architecture can be found in "Exploiting Choice: Instruction Fetch and Issue on Implementable Simultaneous Multi-Threading Processor", Dean N. Tulson, Susan J. Eggers et al, Proceedings of the 23th annual international Symposium on Computer Architecture, Philadelphia , May 1996.

The hyperthreading and garbage threading technologies have been developed with a view to the fact that modern microprocessors derive their performance from many specialized and deep pipeline-driven functional units and high memory hierarchies, which allows high frequencies in the functional cores. Due to the strictly hierarchical memory arrangements, there are major disadvantages in the case of incorrect access to caches due to the difference between core and memory frequencies, since many core clock cycles pass until data is read from the memory. In addition, problems arise with branches and in particular incorrectly predicted branches. It has therefore been proposed to use so-called SMT, simultaneous multi-threading Switching procedures between different tasks whenever an instruction cannot be executed or does not use all functional units.

The technology of the non-applicant documents cited by way of example shows ^" an arrangement, for example, in which configurations can be loaded into a configurable data processing logic cell field, but in which the data exchange between the ALU of the CPU and the configurable data processing logic cell field, be it an FPGA In other words, data from a data stream must first be written sequentially into registers and then be stored sequentially back into them, and there is also a problem if data is to be accessed externally , because even then there are problems with the temporal course of the data processing in comparison to the ALU and with the assignment of configurations, etc. The conventional arrangements as known from the non-registrant's proprietary rights are used among other things for this , Functions in configurable Process data processing logic cell array, DFP, FPGA or the like, which cannot be processed efficiently on the CPU's own ALU. The configurable data processing logic cell array is thus practically used to enable user-defined opcodes that enable algorithms to be processed more efficiently than would be possible on the ALU arithmetic unit of the CPU without configurable data processing logic cell array support.

In the prior art, as was recognized, the coupling is therefore usually word-based, but not block-based, as would be necessary for the data flow processing. It is initially desirable to enable more efficient data processing than is the case with a close coupling via registers.

Another possibility for using logic cell fields consisting of coarse and / or fine-grained logic cells and logic cell elements consists in a very loose coupling of such a field to a conventional CPU and / or a CPU core in embedded systems. Here, a conventional, sequential program can run on a CPU or the like, for example a program written in C, C ++ or the like, whereby calls to a data stream processing on the fine and / or coarse-grained data processing logic cell field are instantiated. The problem then is that when programming for this logic cell field, a program that is not written in C or another sequential high-level language must be provided for data stream processing. It would be desirable here for C programs or the like to be processed both on the conventional CPU architecture and on a data processing logic cell field operated in conjunction with them, that is to say that data flow capability is still obtained in quasi-sequential program processing, in particular with the data processing logic cell field. remains, while at the same time it remains possible, in particular, that CPU operation is possible with a coupling that is not loose. It is also ^'already known order within a Datenverarbeitungslogikzellenfeldan- as particular PACT02. (DE-196 51 075.9-53, WO 98/26356), PACT04 (DE 196 54 846.2-5.3, WO 98/29952), PACT08, (DE 197 04 728.9, WO 98/35299) PACT13 (DE 199 26 538.0, WO 00/77652) PACT31 (DE 102 12 621.6-53, PCT / EP 02/10572) whereby it is known to also use sequential data processing to provide work within the data processing logic cell array. In this case, however, partial execution is achieved within a single configuration, for example to save resources, to achieve time optimization and so on, without this already leading to a programmer automatically and easily placing a piece of high-level language code on a data processing logic cell - implement field, as is the case with conventional machine models for sequential processors. The implementation of high-level language code on data processing logic cell fields according to the principles of models for sequentially operating machines remains difficult.

It is also known from the prior art that a plurality of configurations, each of which causes a different functioning of array parts, can be processed simultaneously on the processor field (PA) and that a change of one or some of the configuration (s) can be carried out at runtime without disturbing others can be done .. methods and implemented in hardware means to implement them known, can be provided as ensure ^'that doing an execution can take place from the field to load part configurations without deadlock. In this regard, reference is made in particular to the applications PACT05 (DE 196 54 593.5-53, WO 98/31102) relating to film technology, PACT10 (DE 198 07 872.2, WO 99/44147,

WO 99/44120) PACT13 (DE 199 26 538.0, WO 00/77652), PACT17 (DE 100 28 397.7, WO 02/13000); PACT31 (DE 102 12 621.6, WO 03/036507). This technology allows in a way, already parallelized and, with appropriate staltung Ge and assignment of configurations, ^'a kind of multi-tasking / multi-threading, and indeed such that a design, that is, a scheduling and / or a Zeitnutzungs- planning control is provided. Time usage planning control means and methods are therefore known per se from the prior art which, at least with the corresponding assignment of configurations to individual tasks and / or threads to configurations and / or configuration sequences, permit multitasking and / or multithreading. The use of such time-use planning control means, which were used in the prior art for configuration and / or configuration management, for purposes of scheduling tasks, threads, multithreads and hyperthreads is regarded as inventive per se.

It is also desirable, at least in accordance with one aspect, in preferred variants, to be able to support modern data processing and program processing technologies such as multitasking, multithreading, hyperthreading, at least in preferred variants of a semiconductor architecture.

The basic idea of the invention is to provide something new for commercial use.

The solution to this problem is claimed in an independent form. Preferred embodiments can be found in the subclaims.

A first essential aspect of the present invention is therefore to be seen in the fact that data are supplied to the data processing logic cell array in response to the execution of a load configuration by the data processing logic cell array and / or data from this data processing logic cell array are written away (STORE) by a STORE configuration is processed accordingly. This La- In this case, configuration and / or storage configurations are preferably to be designed in such a way that addresses are generated directly or indirectly within the data processing logic cell field of those storage locations which are to be accessed directly and indirectly for loading and / or storage. This configuration of address generators within a configuration makes it possible to load a large amount of data into the data processing logic cell field, where they can be stored in internal memories (iRAM) and / or where they can be stored in internal cells such as EALUs with registers and / or the like can be filed. The loading or storage configuration thus enables a block-wise and almost data stream-like, in particular fast loading of data, in particular comparatively compared to individual access, and such a loading configuration can be carried out before one or more configuration (s) processing and / or actually processing data, with which the preloaded data are processed. Data loading and / or writing can typically take place in large logic cell fields in small areas thereof, while other areas are concerned with other tasks. With regard to this and other special features of the invention, reference is made to FIG. 1. In the ping-pong-type data processing described in other published documents by the applicant, in which memory cells are provided on both sides of a data processing field, the data in a first processing step from the memory on one side through the data processing field to the memory on the flow to the other side, there the intermediate results obtained during the first field flow are stored in the second memory, the field is possibly reconfigured, the intermediate results then for for further processing, etc., a memory page can be preloaded with new data by means of a LOAD configuration in one part of the array, while data with a STORE configuration in another part of the array is written away from the opposite side of the memory. This simultaneous LOAD / STORE procedure is also possible without spatial storage area separation.

It should be mentioned again that there are various ways of filling internal memories with data. The internal memories can in particular be preloaded beforehand by separate charging configurations using date.nstrom-like access. This corresponds to the use as a vector register, with the result that the internal memories will always be at least partially a part of the externally visible state of the XPP and must therefore be saved or written back when the context changes. As an alternative and / or in addition, the internal memories (iRAMs) can be loaded onto the CPU by separate “loading instructions”. This leads to reduced loading processes through configurations and can result in a broader interface to the memory hierarchy accessed.

The precharge can also be a burst of memory by instruction from the cache controller. Furthermore, it is possible, and this is preferred as particularly powerful in many cases, to design the cache in such a way that a specific precharge instruction specifies a specific memory area, which is defined by the start address and size or step size (s) maps internal memory (iRAM). When all internal RAMs are allocated, the next one can Configuration must be activated. Activation entails waiting until all burst-like loading processes have been completed. However, this is transparent insofar as the preload instructions are issued long enough beforehand and the cache localization is not destroyed by interrupts or task changes. In particular, a "preload cleah" instruction can then be used, which prevents data from being loaded from the memory.

A synchronization instruction is required to ensure that the content of a specific memory area, which is cached in the IRAM, can be written back to the memory hierarchy, which can be done globally or by specifying the memory area to be accessed; the global access corresponds to a "full write back". To simplify the precharging of the IRAM, it is possible to do this by simply specifying a base address, possibly one or more step sizes (when accessing multidimensional data fields) and an overall run length to specify and store them in registers or the like and then access these registers to determine how to load.

It is particularly preferred if the registers are designed as FIFOs. A FIFO can then be provided for a large number of virtual processors in a multithreaded environment. In addition, storage locations can be provided for use as TAG storage, as is customary with caches.

It should also be mentioned that marking the content of IRAMs as "dirty" in the cache sense is helpful so that the in- can be written back to an external memory as soon as possible if it is not to be used again in the same IRAM. The XPP field and the cache controller can thus be regarded as a single unit since they do not require different instruction streams. Rather, the cache controller can be seen as the implementation of the stages "configuration fetch", "operand fetch ^Λ> (IRAM preload) and" write back ", ie CF, OF and WB, in the XPP pipeline, with the execution stage also (ex) is triggered. Because of the long latencies and the

Unpredictability, for example due to cache misses or configurations of different lengths, it is advantageous if the stages are overlapped over several configurations, the configuration and data preloading FIFO (pipeline) being used for loose coupling. It should be mentioned that the preload may be followed by FILMO, known per se. It should also be mentioned that the preloading can be speculative, whereby the speculation measure can be determined depending on the compiler. However, a disadvantage due to incorrect preloading does not arise insofar as configurations that have not been carried out but only preloaded can easily be released for overwriting, as well as assigned data. The preloading of the FIFO can precede several configurations and may depend on the properties of the algorithm. It is possible to use hardware for this.

As for restoring used data from the IRAM in external storage, this can by a suitable, carried the XPP associated cache controller, but where ^'it is pointed out that this is typically prioritize its tasks and executes preferred precharge, on A high due to the assigned execution status Have priority. On the other hand, precharging can also be blocked by a higher-level IRAM instance in another block or the lack of empty IRAM instances in the target IRAM block. In the latter case, the configuration can wait until a configuration and / or a write-back has ended. The IRAM instance in a different block can be in use or "dirty". It can be provided that the clean IRAMs used last are discarded, that is to say are considered "empty". If there are neither empty nor clean IRAM instances, a "dirty" IRÄM part or a non-empty part must be written back to the memory hierarchy. Since there can only ever be one instance in use and there is more than one instance in one If there is to be an IRAM block so that a cache effect is achieved, it cannot happen that neither empty nor clean nor “dirty” IRAM instances exist.

Examples of architectures in which an SMT processor is coupled to an XPP thread resource can be found, for example, in FIGS. 4a-c.

In the variant presented and preferred here, too, it may be necessary to limit the memory traffic, which is possible in different ways during context changes. In this way, pure read data need not be saved, as is the case with configurations. For non-interruptible (non-preemptive) configurations need the local states of buses and ^PAE's are not saved.

It can be provided that only modified data is stored and cache strategies can be used to reduce storage traffic. For this purpose, an LRU strategy (LRU = least recently used) can be implemented, in particular in addition to a precharging mechanism, especially in the case of frequent context changes.

If IRAMs are defined as local cache copies of the main memory and each IRAM is assigned a start address and modification status information, it is preferred that the IRAM cells are also replicated as for SMT support, so that only the start addresses of the IRAMs saved and. must be reloaded as context. The start addresses for the IRAMs of a current configuration then select the IRAM instances with identical addresses for use. If no address tag of an IRAM instance corresponds to the address of the newly loaded or to be reloaded context, the corresponding memory area can be loaded into an empty IRAM instance, which is to be understood here as a free IRAM area. If one is not available, the procedures described above can be used.

It should be noted, moreover, that due writeback delays using ^'a particular separate state machine (cache controller) can be avoided, is with which tried to restore currently inactive IRAM instances during unused memory cycles.

It should be noted that, as can be seen from the above, the cache is preferably to be understood as an explicit cache and not as a transparent cache to the programmer and / or compiler as usual. In order to en control, the following instructions can be output, for example by the compiler: configuration preload instructions which precede IRAM preload instructions which are used by that configuration. Such configuration precharge instructions should be provided by the scheduler as early as possible. Further, that is, alternatively and / or additionally, IRAM precharge instructions can also be provided, which should also be provided by the scheduler at an early stage, and configuration execution instructions can be provided, the IRAM precharge instructions for this configuration follow, and these configuration execution instructions can in particular delay estimated latencies compared to the precharge instructions.

It can also be provided that a configuration wait statement is executed, followed by a statement that forces a cache write-back, both of which are output by the compiler, especially when a statement from another functional unit such as the load / Memory unit can access a memory area that is potentially “dirty” or is in use in an IRAM. This can be used to force a synchronization of the instruction streams and the cache contents while avoiding data hazards Synchronization instructions are not necessarily common.

It should be mentioned that the data loading and / or storage does not necessarily have to be carried out by a completely logic cell field-based procedure. Rather, it is also possible to provide, for example, one or more separate and / or dedicated DMA units, that is to say in particular DMA controllers, which, for. B. at most can also be configured or prepared for function and / or set up by specifying start address, step size, block size, destination addresses etc., in particular by the CT and / or from the logic cell field.

Loading can also take place in particular from and into a cache. This has the advantages that the external communication with larger memory banks is handled via the cache controller, without separate switching arrangements having to be provided within the data processing logic cell field for access to the memory in a read or write manner to be typically very fast and with a low latency at most and that a CPU unit, typically there via a separate LOAD / STORE unit, is also typically connected to this cache, so that access to data and an exchange thereof between the CPU core and data processing logic cell field block by block can be carried out quickly and in such a way that a separate command, for example from the OpCode fetcher of the CPU, does not have to be fetched and processed for each transfer of data.

This cache coupling also proves to be considerably cheaper than coupling a data processing logic cell field to the ALU via registers if these registers only communicate with a cache via a LOAD / STORE unit, as is per se from the non-PACT-cited fonts is known.

A further data connection to the load / storage unit of the or a sequential CPU unit assigned to the data processing logic cell field and / or to its register can be provided. It should be mentioned that such units can be addressed via separate input / output connections (IO ports) of the data processing logic cell arrangement which can be configured in particular as a VPU or XPP and / or by means of one or more multiplexers connected downstream of an ^" individual port".

That in addition to the in particular block-wise and / or streaming and / or in random access, in particular in RMW mode (read-modify-write mode), access to cache areas takes place in a writing and / or reading manner and / or the LOAD / STORE unit and / or the connection (known per se in the prior art) to the register of the sequential CPU also a connection to an external mass storage device such as a RAM, a hard disk and / or another data exchange port such as an antenna and so on can also be mentioned. A separate port can be provided for this access to storage means different in cache and / or LOAD / STORE unit and / or register unit. Suitable drivers, buffers, signal conditioners for level adjustment and so on can be provided here, e.g. B. LS 74244, LS74245, should be mentioned. Incidentally, it should be mentioned that in particular, but not exclusively, for processing a data stream flowing into or flowing in the data processing logic cell field, the logic cells of the field ELUs or ELUs can include and become typical of those on the input and / or output side, in particular both on the input side - As well as short, finely granularly configurable, FPGA-like circuits on the output side and / or can be integrated into the PAE-ALU in order to cut out bit blocks from a continuous data stream, such as this is required for MPEG-4 decoding. On the one hand, this is advantageous if a data stream is to get into the cell and is to be subjected to a kind of preprocessing there without blocking larger PAE units. This is also of particular advantage if the ALU is designed as a SIMD arithmetic unit, in which case a very wide data input word of, for example, 32 bit data width is split over the upstream FPGA-like strips into several parallel data words of, for example, 4 bit width , which can then be processed in parallel in the SIMD arithmetic units, which can significantly increase the overall performance of the system if the corresponding application is required. It should be pointed out that there was talk above of FPGA-like upstream or downstream structures. With FPGA-like, however, what is explicitly mentioned does not necessarily refer to 1-bit granular arrangements. In particular, instead of these hyper-fine-granular structures, it is possible to provide only finer granular structures of, for example, 4-bit width. This means that the FPGA-like input and / or output structures before and / or after an ALU unit, in particular designed as a SIMD arithmetic unit, can be configured, for example, such that 4-bit wide data words are always supplied and / or processed. It is possible to provide cascading here so that, for example, the incoming 32-bit wide data words flow into 4 separated or separating 8-bit FPGA-like structures arranged side by side, these 4 pieces of 8-bit wide FPGA-like structures a second one Stripe with 8 pieces of 4-bit wide FPGA-like structures is added, and if necessary after another such stripe, if this is considered necessary for the respective purpose, for example 16 pieces in parallel next to one another arranged 2 bit wide FPGA-like structures can be provided. If this is the case, compared to purely hyper-fine granular FPGA-like structures, a considerable reduction in configuration effort can be achieved. It should be mentioned that this also leads to the configuration memory and so on of the FPGA-like structure being able to be significantly smaller and thus saving chip area. It should also be mentioned that FPGA-like stripe structures, as also disclosed in connection with FIG. 3, particularly easily enable the implementation of pseudo-random noise generators, in particular with regard to arrangement in the PAE. If, in doing so, the individual received from a single FPGA cell step by step. Output bits are stored back to the FPGA cell, a pseudo-random noise can also be creatively generated sequentially with a single cell, which is considered to be inventive per se, cf. Fig. 5.

In principle, the coupling advantages described above for data block streams can be achieved via the cache; However, it is particularly preferred if the cache is built up in strips (slice-like) and then access to several of the slices can take place simultaneously, in particular to all slices simultaneously. This is advantageous if, as will be discussed later, a large number of threads have to be processed on the data processing logic cell array (XPP) and / or the sequential CPU and / or the sequential CPUs, be it by means of hyperthreading , multitasking and / or multithreading. Cache memory means with disk access or disk access enabling control means are therefore preferably provided. It can e.g. B. each thread can be assigned its own disk. This makes it possible

- 1E later, when processing the threads, ensure that the relevant cache areas are accessed when the command group to be processed with the thread is resumed.

It should be mentioned again that the cache does not necessarily have to be divided into slices and that if this is the case, each slice does not necessarily have to be assigned to a separate thread. However, it should be noted that this is by far the preferred method. It should also be pointed out that there may be cases in which not all cache areas are used simultaneously or temporarily at a given time. Rather, it is to be expected that in typical data processing applications, such as will occur in hand-held mobile telephones (cell phones), laptops, cameras and so on, there will often be times when the entire cache is not required. It is therefore particularly preferred if individual cache areas can be separated from the power supply in such a way that their energy consumption drops significantly, in particular to or near zero. In the case of a slice-wise configuration of the cache, this can be done by slice-wise deactivation of the cache using suitable power disconnection means, cf. for example Fig. 2. The separation can be done either by a down-clocking, clock separation or a power separation. In particular, an access recognition can be assigned to an individual cache disk or the like, which is designed to recognize whether a respective cache area or a respective cache disk currently has a thread, hyperthread or task assigned to it, from which it uses becomes. If the access detection means then determines that this is not the case, a separation is typically clock and / or even performance. It should be mentioned that when the power is switched on again after a disconnection, the cache area can be reactivated immediately, i.e. no significant delay can be expected by switching the power supply on and off, provided that it is implemented in hardware using common suitable semiconductor technologies. This is useful in many applications regardless of the use with logic cell fields.

Another particular advantage that arises with the present invention is that, although there is a particularly efficient coupling with regard to the transfer of data or operands, in particular in block form, balancing is nevertheless not necessary in such a way that the exact same processing time in sequential CPU and XPP or data processing logic cell field is required. Rather, the processing takes place in a practically often independent manner, in particular in such a way that the sequential CPU and the data processing logic cell array arrangement can be considered as separate resources for a scheduler or the like. This allows an immediate implementation of known data processing program splitting technologies such as multitasking, multithreading and hyperthreading. The resultant advantage that path balancing is not required, i.e. balancing between sequential parts (e.g. on a RISC unit) and data flow parts (e.g. on an XPP) leads to the fact that, for example, within the sequential CPU (e.g. the RISC functional units) any number of pipeline stages can be run through, clocking is possible in different ways and so on. Another The advantage of the present invention is that by configuring a loading configuration or a storage configuration into the XPP or other data processing logic cells, the data can be loaded into the field at a speed or can be written out of it, which is no longer determined by the CPU clock speed, the speed at which the OpCode fetcher works, or the like. In other words, the sequence control of the sequential CPU is no longer a bottleneck-like limitation for the data throughput of the data cell logic field without there being only a loose coupling.

While in a particularly preferred variant of the invention it is possible to use the CT (or CM; configuration manager or configuration table) known for an XPP unit in order to configure both one or more XPPs, which are also arranged hierarchically with several CTs Fields and at the same time one or more sequential CPUs to use there as a kind of multithreading scheduler and hardware management, which has the inherent advantage that known technologies such. B. FILMO etc. can be used for hardware-assisted administration in multithreading, it is alternatively and / or, in particular in a hierarchical arrangement, additionally possible that a data processing logic cell field such as an XPP configurations from the OpCode fetcher of a sequential CPU via the coprocessor Interface received. This means that the sequential CPU and / or another XPP can instantiate a call which leads to data processing on the XPP. The XPP is then z. B. via the cache coupling described and / or by means of LOAD and / or STORE configurations, the address generators for loading provide for and / or write away data in the XPP or data processing logic cell field, kept in data exchange. In other words, a coprocessor-like and / or thread resource-like coupling of a data processing logic cell field is possible, while at the same time a data stream-like data loading takes place by means of cache and / or I / O port coupling.

It should be noted that the coprocessor coupling, i. H. the coupling of the data processing logic cell field will typically lead to the fact that the scheduling for this logic cell field will also take place on the sequential CPU or on a higher-level scheduler unit or a corresponding scheduler means. In such a case, the threading control and management practically takes place on the scheduler or the sequential CPU. Although this is possible per se, at least with the simplest implementation of the invention, this will not necessarily be the case. Rather, the data processing logic cell array can be used by calling in the conventional manner as with a standard coprocessor, for example with 8086/8087 combinations.

It should also be mentioned that in a particularly preferred variant, regardless of the type of configuration, be it via the coprocessor interface, the configuration manager (CT) of the XPP or the data processing logic cell field or the like, or the like, or in some other way, it is possible to store in or directly on the data processing logic cell field or under administration of the data processing logic cell field, in particular internal memory, in particular in the XPP architecture, as described in the various pre-registrations and the applicant's publications it is known to address RAM-PAEs or other appropriately managed or internal memories like a vector register, ie to store the amounts of data loaded via the LOAD configuration vector-like in the internal memories as in vector registers, then after reconfiguring the XPP or the data - processing logic cell field, that is to say overwriting or reloading and / or activating a new configuration which carries out the actual processing of the data (in this connection it should be pointed out that for such a processing configuration reference can also be made to a plurality of configurations which e.g. in wave mode and / or to be processed sequentially one after the other) access as with a vector register and then the results obtained and / or intermediate results in turn in the internal memory or external memory managed via the XPP like internal memory in order to store these results store. The memory means under XPP access described in this way in the form of vector registers with processing results are then, after reconfiguring the processing configuration by loading the STORE configuration, appropriately written away, which in turn happens in data stream fashion, be it via the I / O port directly into external memory areas and / or , as particularly preferred, in cache memory areas, which the sequential CPU and / or other configurations can then access at a later point in time on the XPP which previously generated the data or on another corresponding data processing unit.

A particularly preferred variant consists, at least for certain data processing results and / or intermediate results, as storage or vector register means, in which or which the data obtained are to be stored to use an internal memory in which data about a STORE configuration in the cache or another area, which the sequential CPU or another data processing unit can access, are to be written away, but instead the results are to be written directly into corresponding ones, in particular Access-reserved cache areas, which can be organized like slices. This may have the disadvantage of greater latency, especially if the paths between the XPP or data processing logic cell array unit and the cache are so long that the signal propagation times are significant, but may result in no further STORE configuration being required. It should also be mentioned that such a storage of data in cache areas on the one hand, as described above, is possible because the memory in which writing is physically close to the cache controller and is designed as a cache, that, however, alternatively and / or additionally there is also the possibility of part of an XPP memory area, XPP ^ internal memory or the like, in particular in the case of RAM via PAEs, cf. PACT31 (DE 102 12 621.6, WO 03/036507), under the management of one or, in succession, several cache memory controllers. This has advantages if the latency when storing the processing results, which are to be determined within the data processing logic cell field, is to be kept low, while the latency when other units are accessing the memory area which then then only serves as a “quasi-cache” is not or is not significant.

It should also be mentioned that a configuration is also possible such that the cache controller of a conventional server quenziell CPU addresses a memory area as a cache, which, without serving the data exchange with the data processing logic cell field, is physically located on and / or with it. This has the advantage that if applications are running on the data processing logic cell field which have a small local memory requirement at most and / or if only a few further configurations are required in relation to the available memory quantities, these are used as one or more sequential CPUs Cache can be available. It should be mentioned that the cache controller can and will be designed for the management of a cache area with a dynamic scope, ie with a varying size. Dynamic cache size management or cache size management means for dynamic cache management will typically take into account the workload and / or the input / output load on the sequential CPU and / or the data processing logic cell field. In other words, it can be analyzed, for example, how many NOPs there are data accesses on the sequential CPU in a given time unit and / or how many configurations are to be stored in the XPP field in the memory areas provided for this purpose, in order to enable quick reconfiguration, be it in the To enable ways of a wave reconfiguration or in another way. The dynamic cache size disclosed hereby is particularly preferably runtime dynamic, i. H . the cache controller manages a current cache size, which can change from cycle to cycle or cycle group. It should also be pointed out that the access management of an XPP or data processing logic cell field with access as internal memory as with a vector register and as a cache-like memory for external access has already been described in DE 196 54 595 as far as memory access is concerned and PC / DE 97/03013 (PACT03). The cited documents are hereby incorporated in full by reference for the purposes of disclosure.

Above, reference was made to data processing logic cell fields, which can be reconfigured in particular at runtime. It was discussed that a configuration management unit (CT or CM) can be provided for these. The administration of configurations is known per se from the various property rights of the applicant, which are taken for disclosure purposes and with reference to his other publications. It should now be explicitly pointed out that such units and their mode of operation, with which configurations which are currently not yet required can be preloaded, in particular independently of connections to sequential CPUs etc., can also be used very well in order to operate in multitasking mode and / or in hyperthreading and / or multithreading to effect a task or thread and / or hyperthread change, cf. for example 6a - 6c. To this end, use can be made of the fact that during the runtime of a thread or task in the configuration memory for an individual or a group of cells of the data processing logic cell field, for example a PAE of a PAE field (PA), configurations for different tasks, that is to say tasks or Threads or hyperthreads can be loaded. This then means that if a task or thread is blocked, for example if data has to be waited for because it is not yet available, be it because it has not yet been generated or received by another unit, for example due to of latencies, be it because a resource is currently still blocked by another access, then configurations for another task or thread are preloadable and / or preloaded and can be switched to without the

Time overhead for a configuration change in the shadow-loaded configuration in particular must be waited for. While it is possible in principle to use this technique even if the most likely continuation is predicted within a task and a prediction does not apply (prediction miss), this type of operation will be preferred for predictive operation. When used with a purely sequential CPU and / or a plurality of purely sequential CPUs, in particular exclusively with such, multithreading management hardware is thus realized by connecting a configuration manager. With regard to this, reference is made in particular to PACT10 (DE 198 07 872.2, WO 99/44147, WO 99/44120) and PACT17 (DE 100 28 397.7, WO 02/13000). It can be considered sufficient, especially if hyperthreading management is only desired for one CPU and / or a few sequential CPUs, to dispense with certain subcircuits such as FILMO, which are described in the property rights specifically referred to. In particular, this discloses the use of the configuration manager described there with and / or without FILMO for hyperthreading management for one and / or several purely sequential CPUs with or without coupling to an XPP or another data processing logic cell field and hereby claims it. This is seen as an inventive feature in itself. It should also be mentioned that a large number of CPUs can be implemented using the known techniques, such as those found in PACT31 (DE 102 12 621.6-53, PCT / EP 02/10572) and PACT34 (DE 102 41 812.8, PCT / EP) 03/09957) are known, in which one or more sequential CPUs within an array are constructed using one or more memory areas, in particular in the data processing logic cell field, for the construction of the sequential CPU, in particular as a command and / or data register. It should also be noted that in previous applications such as PACT02, (DE 196 51 075.9-53, WO 98/26356), PACT04 (DE 196 54 846.2-53, WO 98/29952), PACT08, (DE 197 04 728.9, WO 98/35299) has disclosed how sequencers with ring and / or random access memories can be constructed.

It should be noted that a task or thread and / or hyperthread change using the known CT technology, cf. PACT10 (DE 198 07 872.2, WO 99/44147, WO 99/44120) and PACT17 (DE 100 28 397.7, WO 02/13000) can be done in such a way and preferably will also be done that a software-implemented operating system Schedulers or the like are assigned by the CT performance slices and / or time slices, during which it is determined, by which tasks or threads subsequently which parts per se, assuming that resources are free to be processed. An example is given as follows: First, an address sequence is to be generated for a first task, according to which, during the execution of a LOAD configuration, data from a memory and / or cache memory to which a data processing logic cell array is coupled in the manner described, should be loaded. As soon as this data is available, processing of a second, the actual data processing configuration, can begin. This can also be preloaded, since it is certain that this configuration must be carried out unless interrupts or the like force a complete task change. In conventional processors that is now Known problem of the so-called cache miss, in which the data is requested but is not available in the cache for load access. If such a case occurs in a coupling according to the present invention, it is preferably possible to switch to another thread, hyperthread and / or task, which in particular has previously been carried out by the software-implemented operating system scheduler and / or another hardware and / or software-implemented, correspondingly acting unit was determined for the next possible execution and accordingly preferably in advance in one of the available configuration memories of the data processing logic cell field, in particular in the background during the execution of another configuration, for example the LOAD configuration, which is the loading of the data that is now waiting for, effect, was loaded. The fact that for the pre-configuration undisturbed by the actual interconnection of the particularly coarse-grained data processing logic cells of the data processing logic cell field, separate configuration lines can be routed from the configuration unit to the respective cells directly and / or via suitable bus systems, as is known per se in the prior art, is again explicit here mentioned, since this configuration is particularly preferred here in order to enable undisturbed pre-configuration without disturbing another, currently running configuration. PACT10 (DE 198 07 872.2, WO 99/44147, WO 99/44120), PACT17 (DE 100 28 397.7, WO 02/13000) PACT13 (DE 199 26 538.0, WO 00/77652), PACT02 ( DE 196 51 075.9, WO 98/26356) and PACT08 (DE 197 04 728.9, WO 98/35299). When the configuration to which the change was made during or on the basis of the task thread and / or hyper thread change was then carried out, with preferred non-divisible, uninterruptible and quasi atomic configurations were worked through to the end, cf. PACT19 (DE 102 02 044.2, WO 2003/060747) and PACT11 (DE 101 39 170.6, WO 03/017095), another configuration, as predetermined by the corresponding scheduler, in particular the scheduler close to the operating system, is partially defined, processed and / or the configuration for which the associated LOAD configuration was previously carried out. Before executing a processing configuration, for which a LOAD configuration was previously carried out, testing can be carried out in particular, e.g. B. by querying the status of the LOAD configuration or the data loading DMA controller, whether the corresponding data has now flowed into the array, that is, the latency, as it typically occurs, has passed and / or the data is actually available.

In other words, latencies are when they occur because e.g. B. Configurations have not yet been configured, data has not yet been loaded and / or data has not yet been written off, bridged and / or hidden by

Threads, hyperthreads and / or tasks are carried out which are already preconfigured and which work with data which are already available or which can be written off to resources which are already available for the write-off. In this way, latency times are largely covered and, assuming a sufficient number of threads, hyperthreads and / or tasks to be executed per se, a practically 100% utilization of the data processing logic cell field is achieved.

It should be mentioned that the provision of a sufficient number of internal XPP memory resources that are free - e.g. B. by assigned to the scheduler or the CT threads, the cache and / or write operations of several threads can be carried out simultaneously and / or overlapping, which has a particularly positive effect on bridging any latencies.

With the system described with regard to data stream capability with simultaneous coupling to a sequential CPU and / or with regard to the coupling of an XPP array or data processing logic cell array and simultaneously with a sequential CPU to a suitable scheduler unit such as a configuration manager or the like, real-time-capable systems can in particular be readily implemented realize. To ensure real-time capability, it must be ensured that incoming data or interrupts, which signal the arrival of data in particular, can be responded to within a maximum time that can never be exceeded. This can be done, for example, by a task change in response to an interrupt and / or, for example in the case of prioritized interrupts, by stipulating that a given interrupt should be ignored at the moment, this also having to be determined within a certain time. A task change in such real-time capable systems will typically be possible in three ways, namely either when a task has run for a certain time (timer principle), when a resource is not available, • be it due to its blocking by other access or due to latencies when accessing them, in particular in a writing and / or reading manner, that is to say in the event of latencies in data access and / or when interrupts occur. It is also pointed out that, in particular, a runtime-limited configuration on a resource to be released or changed for interrupt processing can retrigger a watchdog or tracking counter.

While it was otherwise explicitly stated, cf. PACT 29 (DE 102 12 622.4, WO 03/081454) that the retriggering of the tracking counter or watchdog for increasing the runtime can be prevented by a task switch is explicitly disclosed in the present case that an interrupt is also, that is, corresponding a task switch, tracking counter - or watchdog - and neutrigger can block, d. H. In such a case, it can be prevented that the configuration itself increases its maximum possible runtime by retriggering.

With the present invention, the real-time capability of a data processing logic cell array can now be achieved by implementing one or more of three possible variants.

A first variant consists of a switch to the processing of an interrupt, for example, within a resource that can be addressed by the scheduler or the CT. If the response times to interrupts or other requirements are so long that a configuration can still be processed without interruption during this time, this is not critical, especially during the processing of the currently running configuration on the resource that has to be changed to process the interrupt , a configuration for interrupt processing can be preloaded. The selection of the interrupt-processing configuration to be preloaded ration is z. B. by CT. It is possible to limit the runtime of the configuration to the resource to be released or changed for interrupt processing. Please refer to PACT29 / PCT (PCT / DE03 / 000942).

In systems that have to react faster to interrupts, it can be preferred to reserve a single resource, for example a separate XPP unit and / or parts of an XPP field, for such processing. If an interrupt to be processed quickly occurs, either a configuration that has already been preloaded for particularly critical interrupts can be processed or the loading of an interrupt handling configuration into the reserved resource is started immediately. A selection of the configuration required for the corresponding interrupt is possible by means of appropriate triggering, wave processing, etc.

It should also be mentioned that it is easily possible with the methods already described to obtain an instantaneous response to an interrupt by achieving code reentrance by using LOAD / STORE configurations. Here, after each data-processing configuration or at given times, for example every five or ten configurations, a STORE configuration is carried out and then a LOAD configuration is carried out with access to those memory areas into which the write-off was previously carried out. If it is ensured that the memory areas used by the STORE configuration remain unaffected until another configuration has written off all relevant information (states, data) as a result of the progress in the task, it is ensured that when reloading, that is Re-entry into a configuration or configuration chain that has already started but has not been completed, the same conditions are obtained again. Such an interposition of LOAD / STORE configurations with simultaneous protection of not yet outdated STORE memory areas can be automatically generated very easily without additional programming effort, e.g. B. from a compiler. There the resource reservation can be advantageous if necessary. It should be mentioned again that resource reservation and / or in other cases can react to at least a number of high-priority interrupts by preloading certain configurations.

A further, particularly preferred variant of the response to interrupts, if at least one of the accessible resources is a sequential CPU, consists in executing an interrupt routine on it, in which code for the data processing logic cell field is again prohibited. In other words, a time-critical interrupt routine is only processed on a sequential CPU without XPP data processing steps being called. This guarantees that the processing operation on the data processing logic cell field cannot be interrupted and further processing can then take place on this data processing logic cell field after a task switch. Although the actual interrupt routine does not have an XPP code, it can nevertheless be ensured that an interrupt at a later, no longer real-time point in time with the XPP leads to a state detected by an interrupt and / or a real-time request and / or Data can be responded using the data processing logic cell array.

Claims

claims

1. Data processing device with a data processing logic cell field and at least one sequential CPU, characterized in that a coupling of the sequential CPU and the data processing logic cell field for data exchange is possible in particular in block form by lines leading to a cache memory.

2. Method for operating a reconfigurable unit with runtime-limited configurations, in which the configurations can increase their maximum permissible runtime, in particular by triggering a tracking counter, characterized in that a configuration runtime increase is prevented by the configuration in response to an interrupt.