UTeM student computer organization and architecture

Monday, December 16, 2013

Boolean Equation

Boolean equation can be represented in 2 forms:

Sum of products (SOP)

i. Combination of input values that produce 1s is converting into equivalent variables.

ii. AND together then OR with other combination variables with the same output.

iii. SOP is easier to derive in truth table compare to POS.

Product of sums (POS)

i. Combination of input values that produce 0s is converting into equivalent variables.

ii. OR together then AND with other combination variables with the same output.

iii. POS usually use when more 1s produce in output function.

Example:

Sum of products (SOP):

Y = A’B’C + A’BC’ + AB’C’ + ABC

Product of sums (POS):

Y = A’B’C’ + A’BC + AB’C +ABC’

Sunday, December 15, 2013

Measuring Cache Performance

Measuring Cache Performance

There are two different techniques for improving cache performance.

i. One focuses on reducing the miss rate by reducing the probability that

2 different memory blocks will contend for the same cache location.

ii. The second technique reduces the miss penalty by adding an additional level to the

hierarchy.

CPU time can divide into the clock cycles that the CPU spends executing

the program and the clock cycles that the CPU spends waiting for the memory

system. We assume that the costs of cache accesses that are hits are part

of the CPU execution cycles. So,

CPU time= (CPU execution clock cycles+ Memory stall clock cycles)

x Clock cycle time

The memory stall clock cycles come mainly from cache misses,

In real processors, the stalls generated by reads and writes can be

quite complex, and accurate performance prediction usually requires

detailed simulations of the processor and the memory system.

Memory-stall clock cycles can be defined as the sum of the staU cycles coming

from reads plus those coming from writes:

Memory stall cycles = (Memory accesses / Program) x Miss rate x Miss penalty

We can also factor this as :

Memory stall clock cycles = (Instructions / Program) x (Misses / Instruction) x Miss penalty

Cache Performance Example

Assume an instruction cache miss rate for a program is 2% and a data cache miss rate is 4%. If a processor has a CPI of 2 without any memory stalls and the miss penalty is 100 cycles for all misses, determine how much faster a processor would run with a perfect cache that never missed.

solution:

Let I =instruction count

Instruction miss cycles = I X 2% X 100 = 2.00 X I

Data miss cycles = I x 36% x 4% x 100 = 1.44 X I

The total number of memory stall cycles is 2.00 I + 1.44 I =3.44 I. This is more than 3 cycles of memory stall per instruction. Accordingly, the CPI with memory stalls is 2 + 3.44 =5.44. Since there is no change in instruction count or clock rate, the ratio of the CPU execution times is

(CPU time with stalls / CPU time with perfect cache = ( I X CPIstall X Clock cycle / I X CPIperfect X Clock cycle)

CPIstall / CPIperfect = 5.44/2

= 2.72

The performance with the perfect cache is better by 2.72

Performance summary

i. When CPU performance increase, miss penalty becomes more significant.

ii. Decreasing base CPI will make the proportion of time spent on memory stalls more greater.

iii. Increasing clock rate, memory stalls account for more CPU cycles.

iv. Cache behavior cannot be neglected when evaluating system performance.

Associative Caches

Fully associative

1. A block in memory may be associated with any entry in the cache.

2. To find a given block in a fully associative cache, all the entries in the cache must be searched because a block can be placed in any one.

3. To make the search practical, it is done in parallel with a comparator associated with each cache entry. These comparators significantly increase the hardware cost, effectively making fully associative placement practical only for caches with small numbers of blocks.

n-way set associative

1. Consists of a number of sets, each of which consists of n blocks.

2. Each block in the memory maps to a unique set in the cache given by the index field, and a block can be placed in any element of that set.

3. Remember that in a direct-mapped cache, the position of a memory block is given by :

(Block number) modulo (Number of cache blocks)

4. In a set-associative cache, the set containing a memory block is given by :

(Block number) modulo (Number of sets in the cache)

5. N comparators is less expensive compared to fully associative.

Associative Cache Example

For a cache with 8 entries

Compare 4-block caches

i. Direct mapped

ii. 2-way set associative

iii. Fully associative

Block access sequence: 0, 8, 0, 6, 8

Direct mapped & 2-way set associative

Set Associative Cache Organization

Thursday, December 12, 2013

FLOATING POINT

THE PROCESSOR

The Processor

1. An overview of the implementation of instructions.

The memory-reference instructions: load word (lw) and store (sw).

The arithmetic-logical instructions: add, sub, and, or, slt, etc.

The instructions: branch equal (beq) and jump (j).

For every instruction, the first two steps are identical:

Send the program counter (PC) to the memory that contains the code and fetch the instruction from that memory.
Read one or two registers, using fields of the instruction to select the registers to read. For the load word instruction, we need read only one register, but most other instructions require that we read two registers.

After these two steps, the actions required to complete the instruction depend on the instruction class.

For example:

I. All instruction classes, except jump, use the arithmetic-logical unit (ALU) after reading the registers.

II. The memory-reference instructions use the ALU for an address calculation, the arithmetic-logical instructions for the operation execution, and branches for comparison.

III. Memory reference will need to access the memory either to write data or read data for a load.

IV. Arithmetic-logical instruction must write the data from the ALU back into register.

V. For branch instructions, we may need to change the next instruction.

An abstract view of the implementation of the MIPS subset showing the major functional units and and the major connection between them.

In several places, there are multiple connections going to a particular unit as coming from two different sources. To solve this problem, multiplexor (MUX) aka data selector is used to select from among several inputs based on the setting.

TOP MULTIPLEXOR: Controls what value replaces the PC (PC+4)

MIDDLE MULTIPLEXOR: Indicate that is a branch

BOTTOM MULTIPLEXOR: To determine whether the second ALU input is from the register.

2. Logic Design Conventions

There are two different types of logic elements:

a. Elements that operate on data values: Combinational Elements.

b. Elements that operate on state: State Elements.

· Combinational Elements

Given the same input. A combinational element always produces the same output.

· State Elements

Has at least two inputs and one output. The required inputs are the data value to be written in the element and the clock, which determines when the data is written. The output provides the value that was written in an earlier clock cycle.

· Clocking Methodology

A clocking methodology defines when signals can be read and when they can be written.

i) Edge-triggered clocking methodology

Any values stored in a sequential logic element are updated only on a clock edge. This allows us to read the contents of a register, sent the value through some combinational logic, and write that register in the same clock cycle.

ii) Control Signal

We do not show a write control signal when a state element is written on every active clock edge. If a state element is not updated on every clock, then an explicit write control signal is required.

Nearly all these state and logic elements will have inputs and outputs that are 32 bit wide since that is the width of most of the data handled by the processor.

3. Building A Datapath

Let’s start of by looking at the datapath elements:

· State Element:

Program Counter (PC) is used to hold the address of the current instruction.

Memory unit is to store the instructions of a program and supply instructions given an address.

· Adder:

· This adder is combinational, can be built from the ALU simply by wiring the control lines so that the control always specifies an add operation.

To execute any instruction, we must start by fetching the instruction from memory. To prepare for executing the next instruction, we must also increment the program counter so that it points at the next instruction, 4 bytes later.

· Register File:

A register file is a collection of registers in which any register can be read or write by specifying the number of the register in the file. The register file contains the register state of the machine. In addition, we will need an ALU to operate on the values read from the registers.

Because the R-format instructions have three register operands, we will need to read two data words fro the register file and write one data word into the register file for each instruction.

For each data word read from register, we need an input to the register file that specifies the register number to be read and an output from the register file that will carry the value that has been read from the registers.

· ALU:

The operation to be performed by ALU is controlled with the ALU operation signal, which will be 32 bits for input and output, 1 bit output if the result is zero.

4. Pipelining

Pipelining is an implementation technique in which multiple instructions are overlapped in execution.

According to the picture above, we use laundry analogy for pipelining. This picture shows that to complete a cycle of laundry it needs 1.5hours. To complete all four, it will be 1.5 x 4 = 6 hours of work which is too long. Therefore, pipelining process is required.

This picture shows the process and advantage of using pipelining. The advantage is that now the required is less than before which is only 3 hours, half of the original.

We can turn the pipelining speedup discussion above into a formula. If the stages are perfectly balanced, then the time between instructions on the pipelined processor is:

Time between instructions (nonpipelined)

Time between instructions (pipelined) = ___________________________________________

Number of pipe stages

· Pipeline Hazards

There are situations in pipelining when the next instruction cannot execute in the following clock cycle, these called hazards.

1. Structural Hazards

The hardware cannot support the combination of instructions that we want to execute in the same clock cycle.

2. Data Hazards

Occur when the pipeline must be stalled because one step must wait for another to complete. Data hazard arise from the dependence of one instruction on an earlier one that is still in the pipeline.

To solve this, forwarding or bypassing is needed. This is a method of resolving a data hazard by retrieving the missing data element from internal buffers rather than waiting for it to arrive from programmer-visible registers or memory.

3. Control Hazards

Arising from the need to make a decision based on the results of one instruction while others are executing.

There are two solutions to that:

· Stall

Just operate sequentially until the first instruction is done and then repeat until you have the right formula. This conservative option certainly works, but it is slow.

· Predict

Always to predict that branches will be untaken. When you are right, the pipeline proceed with full speed. Only then branches are taken does the pipeline stall.

Saturday, December 7, 2013

INTRODUCTION OF MULTIMEDIA LOGIC

MULTIMEDIA Logic is a drag and drop logic design software. You can draw your logic design on the white canvas and then we can simulate the circuit.

STEP 1

To draw the circuit, you will need the palette to choose which component need to build that circuit.

The mouse pointer indicates as selector, use this selector to select or move icon.

STEP 2

Find logic AND int the palette and click the AND gate and place it (click) on the canvas.

STEP 3

Find the switch and click the switch, place on the canvas near the input of the gate.

The switch would look like the figure below.

The purpose of using the switch is to test the truth table. If the switch is flipped up, it means 1 and 0 if it is flipped down. Look at the figure below.

STEP 4

Find the LED icon and place on the output side of the gate. Do not confuse with the 7 LED Segment !!

To connect each component, use wire icon to connect each dot to the respective gate.

Click at dot end and drag the line to the other dot end.

STEP 5

Label the input and output accordingly. Find icon capital A (that refer to Text label) , and click next to the input and output. To edit the text, double click at the text to open Text properties.

Type the input label on Text box. Click OK.

It will show like figure below.

STEP 6

Next, simulate the logic by clicking the Simulate menu, choose Run.

Or you can just click the play button to run the simulation.

STEP 7

Test the AND truth table by clicking the switch to see whether LED will light on. LED will lights on if the condition met. In the logic, only when all switches is set to 1 then the LED will light RED.

To stop the simulation, press Stop button. This circuit can be save into .Igi file or printed by using the file menu.

Post by Nur Amelina Hassan B031310041

MEMORY ORGANIZATION : CACHE MEMORY

Memory Organization : Cache Memory

WHAT DO WE WANT IN A MEMORY?

	Capacity	Latency	Cost
Register	1000’s of bits	10 ps	$$$$
SRAM	1-4 Mbytes	0.2 ns	$$$
DRAM	1-4 Gbyes	5 m	$
Hard disk*	100’s Gbytes	10 ms
Want?	2-10 Gbytes	0.2ns	Cheap!

Key Idea

Key Idea : Exploit “ Principle of Locality “

Keep data used often in a small fast SRAM

· called “CACHE”, often on the same chip as the CPU

Keep all data in bigger but slower DRAM

· called “Main Memory” , usually separate chip

Access Main Memory only rarely, for remaining data

The reason this strategy works : LOCALITY

· if you access someting now, you will likely access it again(or its neighbours) soon

CACHE ANALOGY

You are writing a term paper for your history class at a table in the library

· As you work you realize you need a book

· You stop writing, fetch the reference, continue writing

· You don’t immediately return the book, maybe you’ll need it again

· Soon you have a few books at your table, and you can work smoothly without needing to fetch more books from the shelves

· The table is a CACHE for the rest of the library

Now you switch to doing your biology homework

· You need to fetch your biology textbook from the shelf

· If your table is full, you need to return one of the history books back to the shelf to make room for the biology book

TYPICAL MEMORY REFERENCE PATTERNS

* Memory Trace

* Temporal Locality

* Spatial Locality

**Memory Trace – A temporal sequence of memory references (addresses) from a real program.

**Temporal Locality – If an item is referenced, nearby items will tend to be referenced again soon.

**Spatial Locality – If an item is referenced, nearby items will tend to be referenced soon.

Exploiting the Memory Hierarchy

= Approach 1 ( Cray, others): Expose Hierarchy

Registers, main memory, disk each available as explicit storage alternatives

Tell programmers : "Use them cleverly"

= Approach 2 : Hide Hierarchy

Programming model : SINGLE kind of memory, single address space.

Transparent to programmer : Machine AUTOMATICALLY assigns locations, depending on runtime usage patterns.

EXPLOITING THE MEMORY HIERARCHY

CPU speed is dominated by memory performance

* More significant than : ISA, circuit optimization, pipelining, etc.

TRICK #1 : Make slow MAIN MEMORY appear faster

* Technique : CACHING

TRICK #2 : Make small MAIN MEMORY appear bigger

* Technique : VIRTUAL MEMORY

MEMORY HIERARCHY LEVELS

Block (aka line) : unit of copying

** May be multiple words

If accessed data is present in upper level

** Hit : access satistified by upper level

* Hit ratio : hits/accesses

If accessed data is absent

**Miss : block copied from lower level

*Time taken : miss penalty

*Miss ration : misses/accesses

= 1 - hit radio

**Then accessed data supplied from upper level

Cache Memory

** Cache memory

-The level of the memory hierarchy closest to the CPU

Given accesses X,....., Xn-1, Xn

DIRECT MAPPED CACHE

Location determined by address

Direct mapped : only one choice

* #line is a power of 2

* Use low-order address bits

TAGS AND VALID BITS

How do we know which particular block is stored in a cache location?

* Store block address as well as the data

* Actually only need the hight order bits

* Called the tag

What if there is no data in a location?

* Valid bit : 1 = present, 0 = not present

* Initially 0

Cache Example

** 8-blocks, 1 word/block, direct mapped

** Initial state

Cache Example

Post by Nur Amelina Hassan B031310041