Checkpointing a superscalar, out-of-order processor for error recovery

Bibliographic Details
Title: Checkpointing a superscalar, out-of-order processor for error recovery
Patent Number: 6,968,476
Publication Date: November 22, 2005
Appl. No: 10/180385
Application Filed: June 26, 2002
Abstract: The present invention relates to data processing systems with built-in error recovery from a given checkpoint. In order to checkpoint more than one instruction per cycle it is proposed to collect updates of a predetermined maximum number of register contents performed by a respective plurality of CISC/RISC instructions in a buffer (CSB)(60) for checkpoint states, whereby a checkpoint state comprises as many buffer slots as registers can be updated by said plurality of CISC instructions and an entry for a Program Counter value associated with the youngest external instruction of said plurality, and to update an Architected Register Array (ARA)(64) with freshly collected register data after determining that no error was detected in the register data after completion of said youngest external instruction of said plurality of external instructions. Handshake synchronization for consistent updates between storage in an L2-cache (66) via a Store Buffer (65) and an Architected Register Array (ARA) (64) is provided which is based on the youngest instruction ID (40) stored in the Checkpoint State Buffer (CSB) (60).
Inventors: Barowski, Harry Stefan (Boeblingen, DE); Schwermer, Hartmut (Stuttgart, DE); Tast, Hans-Werner (Weil im Schoenbuch, DE)
Assignees: International Business Machines Corporation (Armonk, NY, US)
Claim: 1. A method for checkpointing a multiple-processor data processing system in order to provide for error-recovery, said method comprising the steps of: collecting updates of a predetermined maximum number of register contents performed by a respective plurality of CISC or RISC instructions in a checkpoint state buffer, a checkpoint state comprising as many buffering slots as registers can be updated by said plurality of CISC instructions and an entry for a Program Counter value associated with the youngest external instruction of said plurality of CISC instructions; and updating an Architected Register Array (ARA) with currently collected register data after determining that no error was detected in the register data prior or with the completion of said youngest external instruction of said plurality of external instructions.
Claim: 2. The method according to claim 1 further comprising the step of providing Error Detection and correction bits with the ARA entries.
Claim: 3. The method according to claim 1 further comprising the steps of: providing in parallel to said ARA update a second control path which controls the release of STORE data resulting from a plurality of STORE instructions from a Store Buffer into an architected state cache memory; synchronizing said STORE data release with said ARA update by tagging said checkpoint state buffer entry with the external instruction ID of the youngest external instruction of said plurality of instructions; and releasing only such data into architected state cache memory which has an older or equal ID than that youngest one.
Claim: 4. The method according to claim 3 in which the synchronizing step comprises a double handshake operation between an ARA update control and STORE data release control, wherein said double handshake operation comprises the steps of: signaling the youngest external instruction ID to said ARA update control when respective STORE data associated with at least said youngest instruction is residing in said Store Buffer, whereby an ARA update is triggered comprising register instructions having an older ID compared to said signaled youngest external instruction ID; and signaling the youngest external instruction ID associated with the latest ARA update to the STORE data release control thus triggering a STORE data release from the Store Buffer to said architected state cache memory, said release comprising STORE data resulting from instructions having an older ID compared to said signaled youngest external instruction ID.
Claim: 5. The method according to claim 1 including collecting updates of a predetermined extended maximum number of register contents performed by a respective complex external instruction in a checkpoint state buffer, comprising the steps of: reserving a respective extended plurality of checkpoint state buffer (60) entries for receiving the register update data; marking subsequent entries being associated with one and the same complex external instruction with a glue bit; and updating the thus extended checkpoint state in an atomic operation in more than one cycle.
Claim: 6. In a system for checkpointing a multiple-processor data processing system in order to provide for error-recovery, a logic circuit comprising: a checkpoint state buffer collecting updates of a predetermined maximum number of register contents performed by a respective plurality of instructions, said checkpoint state buffer comprising as many buffering slots as registers being updated by said plurality of instructions; an entry for a Program Counter value associated with the youngest external instruction of said plurality of instructions; and an Architected Register Array (ARA) updated with currently collected register data after determining that no error was detected in the register data prior or with the completion of said youngest instruction of said plurality of instructions, and further comprising: a Store Buffer; an architected state cache memory; and a second control path in parallel to said ARA update which controls the release of STORE data resulting from a plurality of STORE instructions from said Store Buffer into said architected state cache memory, said STORE data release being synchronized with said ARA update by tagging said checkpoint state buffer entry with the instruction ID of the youngest instruction of said plurality of instructions, and wherein only such data is released into architected state cache memory which has an older or equal ID than that youngest one.
Claim: 7. The logic circuit according to claim 6 wherein said second control path further comprises a double handshake operation during said synchronizing between an ARA update and STORE data release, wherein said double handshake operation comprises: a first signal generator signaling the youngest external instruction ID to said ARA update control when respective STORE data associated with at least said youngest instruction is residing in said Store Buffer, whereby an ARA update is triggered comprising register instructions having an older ID compared to said signaled youngest external instruction ID; and a second signal generator signaling the youngest external instruction ID associated with the latest ARA update to the STORE data release control thus triggering a STORE data release from the Store Buffer to said architected state cache memory, said release comprising STORE data resulting from instructions having an older ID compared to said signaled youngest external instruction ID.
Claim: 8. In a system for checkpointing a multiple-processor data processing system in order to provide for error-recovery, a logic circuit comprising: a checkpoint state buffer collecting updates of a predetermined maximum number of register contents performed by a respective plurality of instructions, said checkpoint state buffer comprising as many buffering slots as registers being updated by said plurality of instructions; an entry for a Program Counter value associated with the youngest external instruction of said plurality of instructions; and an Architected Register Array (ARA) undated with currently collected register data after determining that no error was detected in the register data prior or with the completion of said youngest instruction of said plurality of instructions, further comprising: a checkpoint state buffer having multiple entries; a complex external instruction collecting updates of a predetermined extended maximum number of register contents in said checkpoint state buffer, wherein said complex external instruction: reserves a respective extended plurality of checkpoint state buffer entries for receiving the register update data; marks subsequent entries being associated with one and the same complex external instruction with a glue bit; and updates the thus extended checkpoint state in an atomic operation in more than one cycle.
Claim: 9. The logic circuit according to claim 8 wherein said checkpoint state buffer comprises a plurality of buffer entries, each of which comprises an instruction ID, a target register address, target register data, and a program counter, whereby a checkpoint state covers a plurality of preferably four entries.
Claim: 10. A data processing system comprising: multiple processors; a checkpointing logic circuit providing for error-recovery; a checkpoint state buffer in said checkpointing logic circuit collecting updates of a predetermined maximum number of register contents performed by a respective plurality of instructions, said checkpoint state buffer comprising as many buffering slots as registers being updated by said plurality of instructions; an entry for a Program Counter value associated with the youngest external instruction of said plurality of instructions; and an Architected Register Array (ARA) updated with currently collected register data after determining that no error was detected in the register data prior or with the completion of said youngest instruction of said plurality of instructions, further comprising: a Store Buffer; an architected state cache memory; and a second control path in parallel to said ARA update which controls the release of STORE data resulting from a plurality of STORE instructions from said Store Buffer into said architected state cache memory, said STORE data release being synchronized with said ARA update by tagging said checkpoint state buffer entry with the instruction ID of the youngest instruction of said plurality of instructions, and wherein only such data is released into architected state cache memory which has an older or equal ID than that youngest one.
Claim: 11. The data processing system according to claim 10 wherein said second control path further comprises a double handshake operation during said synchronizing between an ARA update and STORE data release, wherein said double handshake operation comprises: a first signal generator signaling the youngest external instruction ID to said ARA update control when respective STORE data associated with at least said youngest instruction is residing in said Store Buffer, whereby an ARA update is triggered comprising register instructions having an older ID compared to said signaled youngest external instruction ID; and a second signal generator signaling the youngest external instruction ID associated with the latest ARA update to the STORE data release control thus triggering a STORE data release from the Store Suffer to said architected state cache memory, said release comprising STORE data resulting from instructions having an older ID compared to said signaled youngest external instruction ID.
Claim: 12. The data processing system according to claim 10 further comprising: a checkpoint state buffer having multiple entries; a complex external instruction collecting updates of a predetermined extended maximum number of register contents in said checkpoint state buffer, wherein said complex external instruction: reserves a respective extended plurality of checkpoint state buffer entries for receiving, the register update data; marks subsequent entries being associated with one and the sane complex external instruction with a glue bit; and updates the thus extended checkpoint state in an atomic operation in more than one cycle.
Claim: 13. The data processing system according to claim 12 wherein said checkpoint state buffer comprises a plurality of buffer entries, each of which comprises an instruction ID, a target register address, target register data, and a program counter, whereby a checkpoint state covers a plurality of preferably four entries.
Current U.S. Class: 714/15
Patent References Cited: 6115730 September 2000 Dhablania et al.
6581155 June 2003 Lohman et al.
6785842 August 2004 Zumkehr et al.
6810489 October 2004 Zhang et al.
2004/0221212 November 2004 Ando
Assistant Examiner: Duncan, Marc
Primary Examiner: Beausoliel, Robert
Attorney, Agent or Firm: Augspurger, Lynn L.
Accession Number: edspgr.06968476
Database: USPTO Patent Grants
More Details
Language:English