Remote-sensing image semantic segmentation is usually based on convolutional neural networks (CNNs). CNNs demonstrate powerful local feature extraction capabilities through stacked convolution and pooling. However, the locality of the convolution operation limits the ability of CNNs to directly extract global information. Relying on the multihead self-attention (MHSA) mechanism, transformer shows great advantages in modeling global information. In this letter, we propose a CNN-transformer fusion network (CTFNet) for remote-sensing image semantic segmentation. CTFNet applies a U-shaped encoder-...