ARMs PL011 UART
by Mike Krinkin
In the previous post we managed to try our simplistic EFI loader on 64-bit ARM in QEMU and on HiKey960 board. Now I want to try to explore the aarch64 architecture a bit further and maybe create something interesting worth loading on the board.
Since the previous post I’ve made a few changes to the EFI loader, you can find them on GitHub. The sources for this post are also available on GitHub, but in a different repository.
UART
I think that the first step we need to take when exploring a new device is to figure out how to communicate with it. Depending on the specific board you have you may have different options available, I decided to go with the serial port.
I will work with the virt
board emulated by QEMU and when possible I will
check my code on the HiKey960 board that I have. Both of them as far as I can
tell have PL011 UART.
PL011 UART was either designed or licensed by ARM and seem to be quite common and PL011 specification can be freely downloaded on the ARM site.
What is UART? UART is commonly referring to a serial communication protocol and hardware that implements it. I personally don’t really understand what UART actually is as there doesn’t seem to be a specification defining it.
NOTE: normally I would be very much conserned about it, since for any means of communications it’s important that communicating parties had compatible implementations on physical and logical levels. It’s hard to achieve this compatibility if there is no common specification. However, all the UART controllers I had so far seem to be able to communicate just fine.
However there are some general things that I know. For example, UART is serial. It means that data is trasnferred bit by bit. UART also appear to be full duplex in general. It means that the data may be transferred in both directions at the same time.
More importantly however UART might support different data frame formats and may operate with different speeds. For successful communication receiver and transmitter should use the same data format and operate on a similar speed.
UART data frame seem to transfer one character at a time. The size of the character is configurable to some extent on the modern hardware. For example, PL011 supports 5, 6, 7 and 8 bit long characters (see Chapter 3 Programmers Model, section 3.3.7 Line Control Register, UARTLCR_H in the PL011 specification).
Together with the data it may be possible to include a parity bit that could be used to check for data corruptions on the channel. There are multiple configurations of parity checks available (check even, check odd and even something called sticky check).
After data and parity bits goes stop signal. Interestingly enough there are several different configurations available for the stop signal: 1 stop bit, 2 stop bits and 1.5 stop bits (whatever it means).
In addition to the data, partity and stop bits UART frame contains a start bit that indicates to the receiver that a data transmission is happening. This part doesn’t seem to be configurable, so we are not particularly interested in it.
From personal experience, the commonly used configuration of the data frame consists of 8 data bits, no parity bit and 1 stop bit.
NOTE: working with QEMU it doesn’t really matter how UART frame looks like since there is no actual physical communication involved. With the real hardware it might matter, but it appears to me that configurations used in various hobby projects more or less converge.
The operating speed of UART is commonly refered to as baudrate. Baudrate is also known as symbol rate. In practice for a chanel that can transfer only two kinds of symbols (0 and 1, low and high, etc) the baudrate is just number of bits per second.
When configuring my UART I will be using the following parameters:
- data frame with 8 data bits, no parity bit and 1 stop bit
- baudrate of 115200.
PL011
PL011 is just one of hardware implementation that could be used for UART. And with many others it’s a bit complicated because there a multiple parameters to configure.
As was disucssed in the previous section we need to at least configure the data frame forma and the speed. However with hardware devices the story doesn’t end there.
For example, PL011 has some receive/transmit buffers (FIFOs). So that you can queue multiple characters to be transmitted at the same time and then let the device send them out while the CPU can do other stuff. If you want to use this functionoality it should be configured.
It’s also possible to configure what notifications do you want to get from the device and when. For example, we may ask PL011 to generate an interrupt, when it’s done transmitting all the characters in the FIFO. Again, that should be configured.
NOTE: PL011 specification is actually quite compact by specification standards, but, as with many other specifications, getting a complete picture just reading through it is rather hard.
I just need the simplest way to communicate with the device for some debug printing. So the simplest configuration that achives the goal is good for me. With that in mind here are my goals:
- one directional communication: we only need to send data from the device,
- no interrups, use polling to interact with the device.
So what I need to find out is how to configure the data frame format I want and the speed of the device. In addition to that I need to find out how to check that character transmission is complete and we can send the next character. As for the rest, I will take disabling everything I don’t understand/don’t need as a guding principle when reading through the documentation.
PL011 specification describes a few registers, here is the list of registers that I found relevant for the configuration:
UARTDR
- writing to this register initiates the actual data transmissionUARTFR
- reading this register we can check if transmission is completeUARTIBRD
andUARTFBRD
are responsible for the speedUARTLCR_H
controls the data frame format.
Besides the registers above I also noted the following registers that might require configuring:
UARCR
- we can enable/disable UART, transmission or receiption via this register, so we might want to program it to make sure that UART is enabledUARTIMSC
- allows to mask interrupts, so we might want to program it to make sure that all the interrupts are disabledUARTDMACR
- controls DMA, since I don’t want to figure out how DMA is involved here I will program it to make sure that DMA is disabled.
Programming PL011 registers
The specification seem to imply that all the PL011 registers are memory mapped. That means that we just need to read/write the right memory address to access the registers.
In the chapter 3 Programmers Model, section 3.1 About the programmers model, PL011 specification says that the base address of the memory where the registers are mapped is not fixed, but the offset of the registers from the base address is fixed.
For the registers outlined above the offsets are:
Register Name | Offset |
---|---|
UARTDR | 0x000 |
UARTFR | 0x018 |
UARTIBRD | 0x024 |
UARTFBRD | 0x028 |
UARTLCR_H | 0x02c |
UARTCR | 0x030 |
UARTIMSC | 0x038 |
UARTDMACR | 0x048 |
We can go ahead and put that in the code:
#include <stdint.h> // needed for uint32_t type
static const uint32_t DR_OFFSET = 0x000;
static const uint32_t FR_OFFSET = 0x018;
static const uint32_t IBRD_OFFSET = 0x024;
static const uint32_t FBRD_OFFSET = 0x028;
static const uint32_t LCR_OFFSET = 0x02c;
static const uint32_t CR_OFFSET = 0x030;
static const uint32_t IMSC_OFFSET = 0x038;
static const uint32_t DMACR_OFFSET = 0x048;
We don’t know the actual addresses in memory for those registers since the specification doesn’t provide it, so the base address will become a parameter that we will figure out later. For now, let’s just introduce a structure that will hold the base address:
#include <stdint.h> // needed for uint64_t type
struct pl011 {
uint64_t base_address;
};
NOTE: we will add more fields into the structure later.
Let’s assume that we know the base address and the offset, how do we actually access the registers?
Here I was quite confused by the specification. The specification when it
describes the size of the registers specifies the number of bits. The confusing
part is that the number of bits is not always a multiple of 8. For example, the
UARTFR
register is 9 bit long. How can we access a registers that is 9 bit
long on a platform that only allows 8 bit accesses minimum?
Turns out that all the registers are actually 32 bits long and the bits we need are the lower bits of those 32. We just have to be careful to avoid writing anything but 0s to the unused bits.
So here is the code that I created to access the registers:
#include <stdint.h> // needed for uint32_t type
volatile uint32_t *reg(const struct pl011 *dev, uint32_t offset)
{
const uint64_t addr = dev->base_address + offset;
return (volatile uint32_t *)((void *)addr);
}
Using this function we can read UARTCR
this way:
// dev here is a pointer to struct pl011
uint32_t cr = *reg(dev, CR_OFFSET);
NOTE:
volatile
part here is to make accesses to memory a part of the observable behavior of the program, so that compiler cannot optimize it away because it thinks that the result is not used or that the result might be cached in a register.
Calculating baudrate divisiors
PL011 is a bit weird because it has two registers that are used to control the
speed of UART: UARTIBRD
and UARTFBRD
. Both registers together store a value
that would be used to divide the base clock frequency. The UARTIBRD
stores
the integer part and UARTFBRD
stores the fractional part.
The actual implementation is, of course, up to the hardware, but one easy way to think about it is this. Let’s assume that we have a clock that generate a pulse once a second.
We can send the clock through a scheme that would count the number of pulses
and let through every Nth pulse and hide all other pulses. This way we’d get a
clock that generates a pulse once avery N seconds. So we divided the frequency
of the original clock by N. Basically UARTIBRD
and UARTFBRD
together store
that N.
UARTIBRD
is limited to 16 bits and UARTFBRD
is 6 bits. The fractional part
apparently uses the regular binary encoding, in other words, the fractional
parts stores the number of \(1/2^6 = 1/64\) parts.
The [PL011] specification also provides equations to calculate the desirable divisor value:
\[D = {F_{UARTCLK} \over 16 \times B}\]where \(D\) is the calculated divider value, \(B\) is the desired baudrate, \(F_{UARTCLK}\) is the base clock frequency.
The desired baudrate is 115200 as was mentioned above, but what is the base clock frequency?
The specification doesn’t actually tell us the base clock frequency. Supposedly, as with the base address, it might change from hardware to hardware. For now let’s add it to our structure as a paramter together with the desirable baudrate:
#include <stdint.h> // needed for uint64_t and uint32_t types
struct pl011 {
uint64_t base_address;
uint64_t base_clock;
uint32_t baudrate;
};
To actually calculate the values for UARTIBRD
and UARTFBRD
let’s notice
that we don’t need to work with integer and fractional parts separately. We can
calculate the value of \(64 \time D\) instead. Then the lower 6 bits of the
result will be our fractional part and the higher 16 bits will be our integer
part:
#include <stdint.h> // needed for uint32_t type
static void calculate_divisiors(
const struct pl011 *dev, uint32_t *integer, uint32_t *fractional)
{
// 64 * F_UARTCLK / (16 * B) = 4 * F_UARTCLK / B
const uint32_t div = 4 * dev->base_clock / dev->baudrate;
*fractional = div & 0x3f;
*integer = (div >> 6) & 0xffff;
}
NOTE: the calculations above aren’t absolutely accurate and allow some rounding errors. You can multiply the \(F_{UARTCLK}\) by a higher number to calculate the baudrate more accurately and drop the bits you don’t need. However I don’t strive here for perfect accuracy, so this should be good enough.
Waiting for transmission to complete
When writing to the PL011 we need to make sure that all the previous characters were successfully sent out. So we need to find out how to wait for the transmission to complete.
Also, as I outlined above that I don’t want to use interrupts (for now) we should be able to do that by just polling the device.
In order to do that we can read from the UARTFR
register. Specifically,
according to the specification, the BUSY
bit of UARTFR
will be set to 1 if
UART is currently busy transmitting data.
So we can wait for the transmission to complete this way:
#include <stdint.h> // needed for uint32_t type
static const uint32_t FR_BUSY = (1 << 3);
static void wait_tx_complete(const struct pl011 *dev)
{
while ((*reg(dev, FR_OFFSET) * FR_BUSY) != 0) {}
}
Configuring PL011
Now let’s try to put everything together. I’ll start by adding a few more parameters to our structure that define the data format to avoid spreading magic number through the code:
#include <stdint.h> // needed for uint64_t and uint32_t types
struct pl011 {
uint64_t base_address;
uint64_t base_clock;
uint32_t baudrate;
uint32_t data_bits;
uint32_t stop_bits;
};
I will also create a function to initialize the structure and the device:
int pl011_setup(struct pl011 *dev, uint64_t base_address, uint64_t base_clock)
{
dev->base_address = base_address;
dev->base_clock = base_clock;
dev->baudrate = 115200;
dev->data_bits = 8;
dev->stop_bits = 1;
return pl011_reset(dev);
}
The function takes the pointer to the structure describing the device and two parameters that have to be provided from the outside: base address and base clock frequency. The rest of the parameters are initialized to fixed values.
pl011_reset
function will do all the heavy lifting and will actually
configure the device by writing to the registers. However before we take a look
at the pl011_reset
please notice of the notes in the section 3.3.8 Control
Register, UARTCR of the PL011 specification. One of the notices outlines the
proper way of programming the UARTCR
register:
- We first need to disable UART (see
UARTEN
bit ofUARTCR
) - Then we need to wait for transmission of the current character to complete.
- Flush the FIFO queues (see
FEN
bit ofUARTLCR_H
) - Write the desired values to the
UARTCR
regster - Enable UART (again, see
UARTEN
bit ofUARTCR
).
NOTE: enabling and disabling UART both actually involves writing
UARTCR
, so it’s not immediately clear if steps 4 and 5 need to be two separate steps. I don’t want to spend too much time to figure out what it’s, so I’m going to assume for now that writing toUARTEN
bit ofUARTCR
register should be separated from changing values of all other bits.
Keeping this procedure in mind and remembering that we also want to program a few other registers (UARTLCR_H
, UARTIBRD
, UARTFBRD
, UARTIMSC
and
UARTDMACR
) this is how pl011_reset
might look like:
#include <stdint.h> // needed for uint32_t type
static const uint32_t CR_TXEN = (1 << 8);
static const uint32_t CR_UARTEN = (1 << 0);
static const uint32_t LCR_FEN = (1 << 4);
static const uint32_t LCR_STP2 = (1 << 3);
int pl011_reset(const struct pl011 *dev)
{
uint32_t cr = *reg(dev, CR_OFFSET);
uint32_t lcr = *reg(dev, LCR_OFFSET);
uint32_t ibrd, fbrd;
// Disable UART before anything else
*reg(dev, CR_OFFSET) = (cr & CR_UARTEN);
// Wait for any ongoing transmissions to complete
wait_tx_complete(dev);
// Flush FIFOs
*reg(dev, LCR_OFFSET) = (lcr & ~LCR_FEN);
// Set frequency divisors (UARTIBRD and UARTFBRD) to configure the speed
claculate_divisors(dev, &ibrd, &fbrd);
*reg(dev, IBRD_OFFSET) = ibrd;
*reg(dev, FBRD_OFFSET) = fbrd;
// Configure data frame format according to the parameters (UARTLCR_H).
// We don't actually use all the possibilities, so this part of the code
// can be simplified.
lcr = 0x0
// WLEN part of UARTLCR_H, you can check that this calculation does the
// right thing for yourself
lcr |= ((dev->data_bits - 1) & 0x3) << 5;
// Configure the number of stop bits
if (dev->stop_bits == 2)
lcr |= LCR_STP2;
// Mask all interrupts by setting corresponding bits to 1
*reg(dev, IMSC_OFFSET) = 0x7ff;
// Disable DMA by setting all bits to 0
*reg(dev, DMACR_OFFSET) = 0x0;
// I only need transmission, so that's the only thing I enabled.
*reg(dev, CR_OFFSET) = CR_TXEN;
// Finally enable UART
*reg(dev, CR_OFFSET) = CR_TXEN | CR_UARTEN;
return 0;
}
Sending data
The final piece of the puzzle is how can we actually send the data. Sending the
data is easy, we can just write to UARTDR
reigster and wait until the data is
transmitted:
#include <stddef.h> // needed for size_t type
int pl011_send(const struct pl011 *dev, const char *data, size_t size)
{
// make sure that there is no outstanding transfer just in case
wait_tx_complete(dev);
for (size_t i = 0; i < size; ++i) {
if (data[i] == '\n') {
*reg(dev, DR_OFFSET) = '\r';
wait_tx_complete(dev);
}
*reg(dev, DR_OFFSET) = data[i];
wait_tx_complete(dev);
}
return 0;
}
NOTE: in Unix world
\n
is normally treated as\r\n
by text editors and virtual terminals, but it appears that it’s not always the case. So I’m adding\r
here before\n
just in case.
Testing in QEMU
Before we can test anything in QEMU we need to fill a couple of gaps. There are two parameters that we don’t know yet: base address and the base clock frequency. Those parameters may change from platform to platform. So we need to figure out what are the values of those parameters in QEMU.
Fortunately QEMU allows us to dump the Device Tree describing the platform they simulate.
NOTE: if you are not familiar with Device Tree at all you can refer to the chapter 1 Introduction, section Purpose and Scope of the Device Tree specification that explains what problem this specification is trying to solve (it’s available on GitHub). For this post complete understanding of the Device Tree is not needed, so I’m not covering it.
To get the binary dump of the device tree you can run the following command:
qemu-system-aarch64 -machine virt,dumpdtb=virt.dtb -cpu max
It will produce the dump in the file virt.dtb
of the current durectory. While
we at it, virt
in this case is a name of the platform that QEMU emulate. As
you may have guessed from the name, it’s not a real hardware platform.
Binary dump of the device tree is not human readable, so we have to convert it into a human readable format. We can use Device Tree compiler to do that:
sudo apt-get install device-tree-compiler
dtc -I dtb -O dts virt.dtb
This command will produce the Device Tree in a human readable format as output. Now, you may not be familiar with the format even in a human readable form, so let me post here the relevant parts and what I derived from them:
...
pl011@9000000 {
clock-names = "uartclk\0apb_pclk";
clocks = <0x8000 0x8000>;
interrupts = <0x00 0x01 0x04>;
reg = <0x00 0x9000000 0x00 0x1000>;
compatible = "arm,pl011\0arm,primecell";
};
...
apb-pclk {
phandle = <0x8000>;
clock-output-names = "clk24mhz";
clock-frequency = <0x16e3600>;
#clock-cells = <0x00>;
compatible = "fixed-clock";
};
...
Those two parts are responsible for the PL011 controller and the clock used by that controller. I’m sure you can already spot the base clock frequency from that snippet - it’s 24MHz.
The base address for the PL011 registers is 0x9000000
and this information
comes from the reg
property of the pl011@9000000
node.
NOTE: normally OS kernel should read the Device Tree directly instead of us dumping the Device Tree and hardcoding the parameters we found there in the code, but that would require us to understand the Device Tree in greater details than we do now, so I’m going for a quick and hacky solution for now.
With that information available we can write our entry point:
#include "pl011.h" // where PL011 related declarations live
void main(void)
{
struct pl011 serial;
pl011_setup(
&serial /* base_address = */0x9000000, /* base_clock = */24000000);
pl011_send(&serial, "Hello World\n", sizeof("Hello, World\n");
// There is nowhere to exit, so just hang here
while (1) {}
}
Now we can build everything. In my case the project includes just 3 files:
pl011.h
- declarations for the functions interacting with PL011pl011.c
- implementations of the functions interact with PL011main.c
- where themain
lives.
We need to build those in free standing environment, for aarch64 architecture
and specify that main
function is our entry point.
I’m using LLVM based toolchain (clang
as compiler and lld
as a linker), so
to cross compile all I need to do is to specify the target to the compiler. I’m
using aarch64-unknown-eabi
as my target (Embedded ABI, no specific vendor).
So here is how my Makefile
looks:
CC := clang
LD := lld
CFLAGS := \
-ffreestanding -MMD -mno-red-zone -std=c11 \
-target aarch64-unknown-eabi -Wall -Werror -pedantic
LDFLAGS := \
-flavor ld -e main
SRCS := pl011.c main.c
default: all
%.o: %.c
$(CC) $(CFLAGS) -c $< -o $@
kernel.elf: pl011.o main.o
$(LD) $(LDFLAGS) $^ -o $@
-include $(SRCS:.c=.d)
.PHONY: clean all default
If everything bulds correctly, after running make
you should get kernel.elf
file suitable to be loaded by the EFI loader from
github.com/krinkinmu/efi.
If you build the loader from
github.com/krinkinmu/efi you should have
boot.efi
file - which is UEFI based loader for ELF files.
Finally, if you don’t have it already, you need UEFI firmware for QEMU itself:
sudo apt-get install qemu-efi-aarch64
Now when we have everything let’s setup our working directory:
mkdir -p ~/ws/aarch64
cd ~/ws/aarch64
cp /usr/share/qemu-efi-aarch64/QEMU_EFI.fd OVMF.fd
truncate -s 64m OVMF.fd
mkdir -p root/efi/boot
NOTE:
~/ws/aarch64
will be my working directory through this cycle of posts. NOTE:~/ws/aarch64/root
will store the content of the filesystem that will be available to the virtual machine.
Copy boot.efi
inside to the ~/ws/aarch64/root/efi/boot/boot.efi
file and
kernel.elf
to the ~/ws/aarch64/root/efi/boot/kernel
file.
Now we can run QEMU:
qemu-system-aarch64 -machine virt -cpu max \
-drive if=pflash,format=raw,file=~/ws/aarch64/OVMF.fd \
-drive format=raw,file=fat:rw:~/ws/aarch64/root \
-net none \
-serial telnet:localhost:1234,server \
-monitor telnet:localhost:1235,server,nowait \
-nographic
This command will start QEMU using the provide UEFI firmware image (OVMF.fd
)
and providing content of the ~/ws/aarch64/root
as a FAT-formatted storage
device to the virtual machine.
QEMU will open two ports for telnet to connect to: 1234 and 1235. The port 1234 will be attached to the serial interface, the PL011 device we care about. The port 1235 will be attach to the QEMU monitor interface (we don’t care much about it).
QEMU will wait until somebody connects to the port 1234 before starting the actual emulation. To see the serial output we’d need to connect to that port. In order to do that, in a different terminal do this:
telnet localhost 1234
If everything goes fine, you should see UEFI Shell in the telnet. Once in the
UEFI Shell you can load the program using the boot.efi
inside fs0:\efi\boot
directory:
Shell> fs0:
FS0:\> cd efi\boot
FS0:\efi\boot\> ls
Directory of: FS0:\efi\boot\
11/28/2020 12:28 <DIR> 8,192 .
11/28/2020 10:53 <DIR> 8,192 ..
11/28/2020 12:28 12,800 boot.efi
11/29/2020 18:27 2,360 kernel
2 File(s) 15,160 bytes
2 Dir(s)
FS0:\efi\boot\> boot.efi
Program headers:
p_type: PT_PHDR, p_flags: PF_R, p_offset: 0x40, p_vaddr: 0x200040, p_paddr: 0x200040, p_filesz: 0xE0, p_memsz: 0xE0, p_align: 0x8
p_type: PT_LOAD, p_flags: PF_R, p_offset: 0x0, p_vaddr: 0x200000, p_paddr: 0x200000, p_filesz: 0x12E, p_memsz: 0x12E, p_align: 0x10000
p_type: PT_LOAD, p_flags: PF_X | PF_R, p_offset: 0x130, p_vaddr: 0x210130, p_paddr: 0x210130, p_filesz: 0x404, p_memsz: 0x404, p_align: 0x10000
p_type: 0x6474E551, p_flags: PF_W | PF_R, p_offset: 0x0, p_vaddr: 0x0, p_paddr: 0x0, p_filesz: 0x0, p_memsz: 0x0, p_align: 0x0
Loading ELF image...
Loaded ELF image
Starting ELF image...
Hello, World
Instead of conclusion
Here you go, the contact has been established. I didn’t try this on my HiKey960 yet, but I will definitely try to do that in the future and create a small follow-up describing how it went.
tags: clang - aarch64 - qemu - uart - pl011