29 November 2020

# ARMs PL011 UART

by Mike Krinkin

In the previous post we managed to try our simplistic EFI loader on 64-bit ARM in QEMU and on HiKey960 board. Now I want to try to explore the aarch64 architecture a bit further and maybe create something interesting worth loading on the board.

Since the previous post I’ve made a few changes to the EFI loader, you can find them on GitHub. The sources for this post are also available on GitHub, but in a different repository.

# UART

I think that the first step we need to take when exploring a new device is to figure out how to communicate with it. Depending on the specific board you have you may have different options available, I decided to go with the serial port.

I will work with the virt board emulated by QEMU and when possible I will check my code on the HiKey960 board that I have. Both of them as far as I can tell have PL011 UART.

PL011 UART was either designed or licensed by ARM and seem to be quite common and PL011 specification can be freely downloaded on the ARM site.

What is UART? UART is commonly referring to a serial communication protocol and hardware that implements it. I personally don’t really understand what UART actually is as there doesn’t seem to be a specification defining it.

NOTE: normally I would be very much conserned about it, since for any means of communications it’s important that communicating parties had compatible implementations on physical and logical levels. It’s hard to achieve this compatibility if there is no common specification. However, all the UART controllers I had so far seem to be able to communicate just fine.

However there are some general things that I know. For example, UART is serial. It means that data is trasnferred bit by bit. UART also appear to be full duplex in general. It means that the data may be transferred in both directions at the same time.

More importantly however UART might support different data frame formats and may operate with different speeds. For successful communication receiver and transmitter should use the same data format and operate on a similar speed.

UART data frame seem to transfer one character at a time. The size of the character is configurable to some extent on the modern hardware. For example, PL011 supports 5, 6, 7 and 8 bit long characters (see Chapter 3 Programmers Model, section 3.3.7 Line Control Register, UARTLCR_H in the PL011 specification).

Together with the data it may be possible to include a parity bit that could be used to check for data corruptions on the channel. There are multiple configurations of parity checks available (check even, check odd and even something called sticky check).

After data and parity bits goes stop signal. Interestingly enough there are several different configurations available for the stop signal: 1 stop bit, 2 stop bits and 1.5 stop bits (whatever it means).

In addition to the data, partity and stop bits UART frame contains a start bit that indicates to the receiver that a data transmission is happening. This part doesn’t seem to be configurable, so we are not particularly interested in it.

From personal experience, the commonly used configuration of the data frame consists of 8 data bits, no parity bit and 1 stop bit.

NOTE: working with QEMU it doesn’t really matter how UART frame looks like since there is no actual physical communication involved. With the real hardware it might matter, but it appears to me that configurations used in various hobby projects more or less converge.

The operating speed of UART is commonly refered to as baudrate. Baudrate is also known as symbol rate. In practice for a chanel that can transfer only two kinds of symbols (0 and 1, low and high, etc) the baudrate is just number of bits per second.

When configuring my UART I will be using the following parameters:

1. data frame with 8 data bits, no parity bit and 1 stop bit
2. baudrate of 115200.

# PL011

PL011 is just one of hardware implementation that could be used for UART. And with many others it’s a bit complicated because there a multiple parameters to configure.

As was disucssed in the previous section we need to at least configure the data frame forma and the speed. However with hardware devices the story doesn’t end there.

For example, PL011 has some receive/transmit buffers (FIFOs). So that you can queue multiple characters to be transmitted at the same time and then let the device send them out while the CPU can do other stuff. If you want to use this functionoality it should be configured.

It’s also possible to configure what notifications do you want to get from the device and when. For example, we may ask PL011 to generate an interrupt, when it’s done transmitting all the characters in the FIFO. Again, that should be configured.

NOTE: PL011 specification is actually quite compact by specification standards, but, as with many other specifications, getting a complete picture just reading through it is rather hard.

I just need the simplest way to communicate with the device for some debug printing. So the simplest configuration that achives the goal is good for me. With that in mind here are my goals:

• one directional communication: we only need to send data from the device,
• no interrups, use polling to interact with the device.

So what I need to find out is how to configure the data frame format I want and the speed of the device. In addition to that I need to find out how to check that character transmission is complete and we can send the next character. As for the rest, I will take disabling everything I don’t understand/don’t need as a guding principle when reading through the documentation.

PL011 specification describes a few registers, here is the list of registers that I found relevant for the configuration:

• UARTDR - writing to this register initiates the actual data transmission
• UARTFR - reading this register we can check if transmission is complete
• UARTIBRD and UARTFBRD are responsible for the speed
• UARTLCR_H controls the data frame format.

Besides the registers above I also noted the following registers that might require configuring:

• UARCR - we can enable/disable UART, transmission or receiption via this register, so we might want to program it to make sure that UART is enabled
• UARTIMSC - allows to mask interrupts, so we might want to program it to make sure that all the interrupts are disabled
• UARTDMACR - controls DMA, since I don’t want to figure out how DMA is involved here I will program it to make sure that DMA is disabled.

# Programming PL011 registers

The specification seem to imply that all the PL011 registers are memory mapped. That means that we just need to read/write the right memory address to access the registers.

In the chapter 3 Programmers Model, section 3.1 About the programmers model, PL011 specification says that the base address of the memory where the registers are mapped is not fixed, but the offset of the registers from the base address is fixed.

For the registers outlined above the offsets are:

Register Name Offset
UARTDR 0x000
UARTFR 0x018
UARTIBRD 0x024
UARTFBRD 0x028
UARTLCR_H 0x02c
UARTCR 0x030
UARTIMSC 0x038
UARTDMACR 0x048

We can go ahead and put that in the code:

#include <stdint.h>  // needed for uint32_t type

static const uint32_t DR_OFFSET = 0x000;
static const uint32_t FR_OFFSET = 0x018;
static const uint32_t IBRD_OFFSET = 0x024;
static const uint32_t FBRD_OFFSET = 0x028;
static const uint32_t LCR_OFFSET = 0x02c;
static const uint32_t CR_OFFSET = 0x030;
static const uint32_t IMSC_OFFSET = 0x038;
static const uint32_t DMACR_OFFSET = 0x048;


We don’t know the actual addresses in memory for those registers since the specification doesn’t provide it, so the base address will become a parameter that we will figure out later. For now, let’s just introduce a structure that will hold the base address:

#include <stdint.h>  // needed for uint64_t type

struct pl011 {
};


NOTE: we will add more fields into the structure later.

Let’s assume that we know the base address and the offset, how do we actually access the registers?

Here I was quite confused by the specification. The specification when it describes the size of the registers specifies the number of bits. The confusing part is that the number of bits is not always a multiple of 8. For example, the UARTFR register is 9 bit long. How can we access a registers that is 9 bit long on a platform that only allows 8 bit accesses minimum?

Turns out that all the registers are actually 32 bits long and the bits we need are the lower bits of those 32. We just have to be careful to avoid writing anything but 0s to the unused bits.

So here is the code that I created to access the registers:

#include <stdint.h>  // needed for uint32_t type

volatile uint32_t *reg(const struct pl011 *dev, uint32_t offset)
{

}


Using this function we can read UARTCR this way:

// dev here is a pointer to struct pl011
uint32_t cr = *reg(dev, CR_OFFSET);


NOTE: volatile part here is to make accesses to memory a part of the observable behavior of the program, so that compiler cannot optimize it away because it thinks that the result is not used or that the result might be cached in a register.

# Calculating baudrate divisiors

PL011 is a bit weird because it has two registers that are used to control the speed of UART: UARTIBRD and UARTFBRD. Both registers together store a value that would be used to divide the base clock frequency. The UARTIBRD stores the integer part and UARTFBRD stores the fractional part.

The actual implementation is, of course, up to the hardware, but one easy way to think about it is this. Let’s assume that we have a clock that generate a pulse once a second.

We can send the clock through a scheme that would count the number of pulses and let through every Nth pulse and hide all other pulses. This way we’d get a clock that generates a pulse once avery N seconds. So we divided the frequency of the original clock by N. Basically UARTIBRD and UARTFBRD together store that N.

UARTIBRD is limited to 16 bits and UARTFBRD is 6 bits. The fractional part apparently uses the regular binary encoding, in other words, the fractional parts stores the number of $$1/2^6 = 1/64$$ parts.

The [PL011] specification also provides equations to calculate the desirable divisor value:

$D = {F_{UARTCLK} \over 16 \times B}$

where $$D$$ is the calculated divider value, $$B$$ is the desired baudrate, $$F_{UARTCLK}$$ is the base clock frequency.

The desired baudrate is 115200 as was mentioned above, but what is the base clock frequency?

The specification doesn’t actually tell us the base clock frequency. Supposedly, as with the base address, it might change from hardware to hardware. For now let’s add it to our structure as a paramter together with the desirable baudrate:

#include <stdint.h>  // needed for uint64_t and uint32_t types

struct pl011 {
uint64_t base_clock;
uint32_t baudrate;
};


To actually calculate the values for UARTIBRD and UARTFBRD let’s notice that we don’t need to work with integer and fractional parts separately. We can calculate the value of $$64 \time D$$ instead. Then the lower 6 bits of the result will be our fractional part and the higher 16 bits will be our integer part:

#include <stdint.h>  // needed for uint32_t type

static void calculate_divisiors(
const struct pl011 *dev, uint32_t *integer, uint32_t *fractional)
{
// 64 * F_UARTCLK / (16 * B) = 4 * F_UARTCLK / B
const uint32_t div = 4 * dev->base_clock / dev->baudrate;

*fractional = div & 0x3f;
*integer = (div >> 6) & 0xffff;
}


NOTE: the calculations above aren’t absolutely accurate and allow some rounding errors. You can multiply the $$F_{UARTCLK}$$ by a higher number to calculate the baudrate more accurately and drop the bits you don’t need. However I don’t strive here for perfect accuracy, so this should be good enough.

# Waiting for transmission to complete

When writing to the PL011 we need to make sure that all the previous characters were successfully sent out. So we need to find out how to wait for the transmission to complete.

Also, as I outlined above that I don’t want to use interrupts (for now) we should be able to do that by just polling the device.

In order to do that we can read from the UARTFR register. Specifically, according to the specification, the BUSY bit of UARTFR will be set to 1 if UART is currently busy transmitting data.

So we can wait for the transmission to complete this way:

#include <stdint.h>  // needed for uint32_t type

static const uint32_t FR_BUSY = (1 << 3);

static void wait_tx_complete(const struct pl011 *dev)
{
while ((*reg(dev, FR_OFFSET) * FR_BUSY) != 0) {}
}


# Configuring PL011

Now let’s try to put everything together. I’ll start by adding a few more parameters to our structure that define the data format to avoid spreading magic number through the code:

#include <stdint.h>  // needed for uint64_t and uint32_t types

struct pl011 {
uint64_t base_clock;
uint32_t baudrate;
uint32_t data_bits;
uint32_t stop_bits;
};


I will also create a function to initialize the structure and the device:

int pl011_setup(struct pl011 *dev, uint64_t base_address, uint64_t base_clock)
{
dev->base_clock = base_clock;

dev->baudrate = 115200;
dev->data_bits = 8;
dev->stop_bits = 1;
return pl011_reset(dev);
}


The function takes the pointer to the structure describing the device and two parameters that have to be provided from the outside: base address and base clock frequency. The rest of the parameters are initialized to fixed values.

pl011_reset function will do all the heavy lifting and will actually configure the device by writing to the registers. However before we take a look at the pl011_reset please notice of the notes in the section 3.3.8 Control Register, UARTCR of the PL011 specification. One of the notices outlines the proper way of programming the UARTCR register:

1. We first need to disable UART (see UARTEN bit of UARTCR)
2. Then we need to wait for transmission of the current character to complete.
3. Flush the FIFO queues (see FEN bit of UARTLCR_H)
4. Write the desired values to the UARTCR regster
5. Enable UART (again, see UARTEN bit of UARTCR).

NOTE: enabling and disabling UART both actually involves writing UARTCR, so it’s not immediately clear if steps 4 and 5 need to be two separate steps. I don’t want to spend too much time to figure out what it’s, so I’m going to assume for now that writing to UARTEN bit of UARTCR register should be separated from changing values of all other bits.

Keeping this procedure in mind and remembering that we also want to program a few other registers (UARTLCR_H, UARTIBRD, UARTFBRD, UARTIMSC and UARTDMACR) this is how pl011_reset might look like:

#include <stdint.h>  // needed for uint32_t type

static const uint32_t CR_TXEN = (1 << 8);
static const uint32_t CR_UARTEN = (1 << 0);

static const uint32_t LCR_FEN = (1 << 4);
static const uint32_t LCR_STP2 = (1 << 3);

int pl011_reset(const struct pl011 *dev)
{
uint32_t cr = *reg(dev, CR_OFFSET);
uint32_t lcr = *reg(dev, LCR_OFFSET);
uint32_t ibrd, fbrd;

// Disable UART before anything else
*reg(dev, CR_OFFSET) = (cr & CR_UARTEN);

// Wait for any ongoing transmissions to complete
wait_tx_complete(dev);

// Flush FIFOs
*reg(dev, LCR_OFFSET) = (lcr & ~LCR_FEN);

// Set frequency divisors (UARTIBRD and UARTFBRD) to configure the speed
claculate_divisors(dev, &ibrd, &fbrd);
*reg(dev, IBRD_OFFSET) = ibrd;
*reg(dev, FBRD_OFFSET) = fbrd;

// Configure data frame format according to the parameters (UARTLCR_H).
// We don't actually use all the possibilities, so this part of the code
// can be simplified.
lcr = 0x0
// WLEN part of UARTLCR_H, you can check that this calculation does the
// right thing for yourself
lcr |= ((dev->data_bits - 1) & 0x3) << 5;
// Configure the number of stop bits
if (dev->stop_bits == 2)
lcr |= LCR_STP2;

// Mask all interrupts by setting corresponding bits to 1
*reg(dev, IMSC_OFFSET) = 0x7ff;

// Disable DMA by setting all bits to 0
*reg(dev, DMACR_OFFSET) = 0x0;

// I only need transmission, so that's the only thing I enabled.
*reg(dev, CR_OFFSET) = CR_TXEN;

// Finally enable UART
*reg(dev, CR_OFFSET) = CR_TXEN | CR_UARTEN;

return 0;
}


# Sending data

The final piece of the puzzle is how can we actually send the data. Sending the data is easy, we can just write to UARTDR reigster and wait until the data is transmitted:

#include <stddef.h>  // needed for size_t type

int pl011_send(const struct pl011 *dev, const char *data, size_t size)
{
// make sure that there is no outstanding transfer just in case
wait_tx_complete(dev);

for (size_t i = 0; i < size; ++i) {
if (data[i] == '\n') {
*reg(dev, DR_OFFSET) = '\r';
wait_tx_complete(dev);
}
*reg(dev, DR_OFFSET) = data[i];
wait_tx_complete(dev);
}

return 0;
}


NOTE: in Unix world \n is normally treated as \r\n by text editors and virtual terminals, but it appears that it’s not always the case. So I’m adding \r here before \n just in case.

# Testing in QEMU

Before we can test anything in QEMU we need to fill a couple of gaps. There are two parameters that we don’t know yet: base address and the base clock frequency. Those parameters may change from platform to platform. So we need to figure out what are the values of those parameters in QEMU.

Fortunately QEMU allows us to dump the Device Tree describing the platform they simulate.

NOTE: if you are not familiar with Device Tree at all you can refer to the chapter 1 Introduction, section Purpose and Scope of the Device Tree specification that explains what problem this specification is trying to solve (it’s available on GitHub). For this post complete understanding of the Device Tree is not needed, so I’m not covering it.

To get the binary dump of the device tree you can run the following command:

qemu-system-aarch64 -machine virt,dumpdtb=virt.dtb -cpu max


It will produce the dump in the file virt.dtb of the current durectory. While we at it, virt in this case is a name of the platform that QEMU emulate. As you may have guessed from the name, it’s not a real hardware platform.

Binary dump of the device tree is not human readable, so we have to convert it into a human readable format. We can use Device Tree compiler to do that:

sudo apt-get install device-tree-compiler
dtc -I dtb -O dts virt.dtb


This command will produce the Device Tree in a human readable format as output. Now, you may not be familiar with the format even in a human readable form, so let me post here the relevant parts and what I derived from them:

...

pl011@9000000 {
clock-names = "uartclk\0apb_pclk";
clocks = <0x8000 0x8000>;
interrupts = <0x00 0x01 0x04>;
reg = <0x00 0x9000000 0x00 0x1000>;
compatible = "arm,pl011\0arm,primecell";
};

...

apb-pclk {
phandle = <0x8000>;
clock-output-names = "clk24mhz";
clock-frequency = <0x16e3600>;
#clock-cells = <0x00>;
compatible = "fixed-clock";
};

...


Those two parts are responsible for the PL011 controller and the clock used by that controller. I’m sure you can already spot the base clock frequency from that snippet - it’s 24MHz.

The base address for the PL011 registers is 0x9000000 and this information comes from the reg property of the pl011@9000000 node.

NOTE: normally OS kernel should read the Device Tree directly instead of us dumping the Device Tree and hardcoding the parameters we found there in the code, but that would require us to understand the Device Tree in greater details than we do now, so I’m going for a quick and hacky solution for now.

With that information available we can write our entry point:

#include "pl011.h"  // where PL011 related declarations live

void main(void)
{
struct pl011 serial;

pl011_setup(
&serial /* base_address = */0x9000000, /* base_clock = */24000000);
pl011_send(&serial, "Hello World\n", sizeof("Hello, World\n");

// There is nowhere to exit, so just hang here
while (1) {}
}


Now we can build everything. In my case the project includes just 3 files:

1. pl011.h - declarations for the functions interacting with PL011
2. pl011.c - implementations of the functions interact with PL011
3. main.c - where the main lives.

We need to build those in free standing environment, for aarch64 architecture and specify that main function is our entry point.

I’m using LLVM based toolchain (clang as compiler and lld as a linker), so to cross compile all I need to do is to specify the target to the compiler. I’m using aarch64-unknown-eabi as my target (Embedded ABI, no specific vendor).

So here is how my Makefile looks:

CC := clang
LD := lld

CFLAGS := \
-ffreestanding -MMD -mno-red-zone -std=c11 \
-target aarch64-unknown-eabi -Wall -Werror -pedantic
LDFLAGS := \
-flavor ld -e main

SRCS := pl011.c main.c

default: all

%.o: %.c
$(CC)$(CFLAGS) -c $< -o$@

kernel.elf: pl011.o main.o
$(LD)$(LDFLAGS) $^ -o$@

-include \$(SRCS:.c=.d)

.PHONY: clean all default


If everything bulds correctly, after running make you should get kernel.elf file suitable to be loaded by the EFI loader from github.com/krinkinmu/efi.

If you build the loader from github.com/krinkinmu/efi you should have boot.efi file - which is UEFI based loader for ELF files.

Finally, if you don’t have it already, you need UEFI firmware for QEMU itself:

sudo apt-get install qemu-efi-aarch64


Now when we have everything let’s setup our working directory:

mkdir -p ~/ws/aarch64
cd ~/ws/aarch64
cp /usr/share/qemu-efi-aarch64/QEMU_EFI.fd OVMF.fd
truncate -s 64m OVMF.fd
mkdir -p root/efi/boot


NOTE: ~/ws/aarch64 will be my working directory through this cycle of posts. NOTE: ~/ws/aarch64/root will store the content of the filesystem that will be available to the virtual machine.

Copy boot.efi inside to the ~/ws/aarch64/root/efi/boot/boot.efi file and kernel.elf to the ~/ws/aarch64/root/efi/boot/kernel file.

Now we can run QEMU:

qemu-system-aarch64 -machine virt -cpu max \
-drive if=pflash,format=raw,file=~/ws/aarch64/OVMF.fd \
-drive format=raw,file=fat:rw:~/ws/aarch64/root \
-net none \
-serial telnet:localhost:1234,server \
-monitor telnet:localhost:1235,server,nowait \
-nographic


This command will start QEMU using the provide UEFI firmware image (OVMF.fd) and providing content of the ~/ws/aarch64/root as a FAT-formatted storage device to the virtual machine.

QEMU will open two ports for telnet to connect to: 1234 and 1235. The port 1234 will be attached to the serial interface, the PL011 device we care about. The port 1235 will be attach to the QEMU monitor interface (we don’t care much about it).

QEMU will wait until somebody connects to the port 1234 before starting the actual emulation. To see the serial output we’d need to connect to that port. In order to do that, in a different terminal do this:

telnet localhost 1234


If everything goes fine, you should see UEFI Shell in the telnet. Once in the UEFI Shell you can load the program using the boot.efi inside fs0:\efi\boot directory:

Shell> fs0:
FS0:\> cd efi\boot
FS0:\efi\boot\> ls
Directory of: FS0:\efi\boot\
11/28/2020  12:28 <DIR>         8,192  .
11/28/2020  10:53 <DIR>         8,192  ..
11/28/2020  12:28              12,800  boot.efi
11/29/2020  18:27               2,360  kernel
2 File(s)      15,160 bytes
2 Dir(s)
FS0:\efi\boot\> boot.efi
p_type: PT_PHDR, p_flags: PF_R, p_offset: 0x40, p_vaddr: 0x200040, p_paddr: 0x200040, p_filesz: 0xE0, p_memsz: 0xE0, p_align: 0x8