LKML: "Song Bao Hua (Barry Song)": RE: [PATCH v7 1/3] dma-contiguous: provide the ability to reserve per-numa CMA

> -----Original Message-----
> From: Mike Rapoport [mailto:rppt@linux.ibm.com]
> Sent: Saturday, August 22, 2020 2:28 AM
> To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>
> Cc: hch@lst.de; m.szyprowski@samsung.com; robin.murphy@arm.com;
> will@kernel.org; ganapatrao.kulkarni@cavium.com;
> catalin.marinas@arm.com; akpm@linux-foundation.org;
> iommu@lists.linux-foundation.org; linux-arm-kernel@lists.infradead.org;
> linux-kernel@vger.kernel.org; Zengtao (B) <prime.zeng@hisilicon.com>;
> huangdaode <huangdaode@huawei.com>; Linuxarm <linuxarm@huawei.com>;
> Jonathan Cameron <jonathan.cameron@huawei.com>; Nicolas Saenz Julienne
> <nsaenzjulienne@suse.de>; Steve Capper <steve.capper@arm.com>
> Subject: Re: [PATCH v7 1/3] dma-contiguous: provide the ability to reserve
> per-numa CMA
>
> On Fri, Aug 21, 2020 at 11:33:53PM +1200, Barry Song wrote:
> > Right now, drivers like ARM SMMU are using dma_alloc_coherent() to get
> > coherent DMA buffers to save their command queues and page tables. As
> > there is only one default CMA in the whole system, SMMUs on nodes other
> > than node0 will get remote memory. This leads to significant latency.
> >
> > This patch provides per-numa CMA so that drivers like SMMU can get local
> > memory. Tests show localizing CMA can decrease dma_unmap latency much.
> > For instance, before this patch, SMMU on node2 has to wait for more than
> > 560ns for the completion of CMD_SYNC in an empty command queue; with
> this
> > patch, it needs 240ns only.
> >
> > A positive side effect of this patch would be improving performance even
> > further for those users who are worried about performance more than DMA
> > security and use iommu.passthrough=1 to skip IOMMU. With local CMA, all
> > drivers can get local coherent DMA buffers.
> >
> > Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> > Cc: Christoph Hellwig <hch@lst.de>
> > Cc: Marek Szyprowski <m.szyprowski@samsung.com>
> > Cc: Will Deacon <will@kernel.org>
> > Cc: Robin Murphy <robin.murphy@arm.com>
> > Cc: Ganapatrao Kulkarni <ganapatrao.kulkarni@cavium.com>
> > Cc: Catalin Marinas <catalin.marinas@arm.com>
> > Cc: Nicolas Saenz Julienne <nsaenzjulienne@suse.de>
> > Cc: Steve Capper <steve.capper@arm.com>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Mike Rapoport <rppt@linux.ibm.com>
> > Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
> > ---
> > -v7: with respect to Will's comments
> > * move to use for_each_online_node
> > * add description if users don't specify pernuma_cma
> > * provide default value for CONFIG_DMA_PERNUMA_CMA
> >
> > .../admin-guide/kernel-parameters.txt | 11 ++
> > include/linux/dma-contiguous.h | 6 ++
> > kernel/dma/Kconfig | 11 ++
> > kernel/dma/contiguous.c | 100
> ++++++++++++++++--
> > 4 files changed, 118 insertions(+), 10 deletions(-)
> >
> > diff --git a/Documentation/admin-guide/kernel-parameters.txt
> b/Documentation/admin-guide/kernel-parameters.txt
> > index bdc1f33fd3d1..c609527fc35a 100644
> > --- a/Documentation/admin-guide/kernel-parameters.txt
> > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > @@ -599,6 +599,17 @@
> > altogether. For more information, see
> > include/linux/dma-contiguous.h
> >
> > + pernuma_cma=nn[MG]
>
> Maybe cma_pernuma or cma_pernode?

Sounds good.

>
> > + [ARM64,KNL]
> > + Sets the size of kernel per-numa memory area for
> > + contiguous memory allocations. A value of 0 disables
> > + per-numa CMA altogether. And If this option is not
> > + specificed, the default value is 0.
> > + With per-numa CMA enabled, DMA users on node nid will
> > + first try to allocate buffer from the pernuma area
> > + which is located in node nid, if the allocation fails,
> > + they will fallback to the global default memory area.
> > +
> > cmo_free_hint= [PPC] Format: { yes | no }
> > Specify whether pages are marked as being inactive
> > when they are freed. This is used in CMO environments
> > diff --git a/include/linux/dma-contiguous.h
> b/include/linux/dma-contiguous.h
> > index 03f8e98e3bcc..fe55e004f1f4 100644
> > --- a/include/linux/dma-contiguous.h
> > +++ b/include/linux/dma-contiguous.h
> > @@ -171,6 +171,12 @@ static inline void dma_free_contiguous(struct
> device *dev, struct page *page,
> >
> > #endif
> >
> > +#ifdef CONFIG_DMA_PERNUMA_CMA
> > +void dma_pernuma_cma_reserve(void);
> > +#else
> > +static inline void dma_pernuma_cma_reserve(void) { }
> > +#endif
> > +
> > #endif
> >
> > #endif
> > diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig
> > index 847a9d1fa634..c38979d45b13 100644
> > --- a/kernel/dma/Kconfig
> > +++ b/kernel/dma/Kconfig
> > @@ -118,6 +118,17 @@ config DMA_CMA
> > If unsure, say "n".
> >
> > if DMA_CMA
> > +
> > +config DMA_PERNUMA_CMA
> > + bool "Enable separate DMA Contiguous Memory Area for each NUMA
> Node"
> > + default NUMA && ARM64
> > + help
> > + Enable this option to get pernuma CMA areas so that devices like
> > + ARM64 SMMU can get local memory by DMA coherent APIs.
> > +
> > + You can set the size of pernuma CMA by specifying
> "pernuma_cma=size"
> > + on the kernel's command line.
> > +
> > comment "Default contiguous memory area size:"
> >
> > config CMA_SIZE_MBYTES
> > diff --git a/kernel/dma/contiguous.c b/kernel/dma/contiguous.c
> > index cff7e60968b9..0383c9b86715 100644
> > --- a/kernel/dma/contiguous.c
> > +++ b/kernel/dma/contiguous.c
> > @@ -69,6 +69,19 @@ static int __init early_cma(char *p)
> > }
> > early_param("cma", early_cma);
> >
> > +#ifdef CONFIG_DMA_PERNUMA_CMA
> > +
> > +static struct cma *dma_contiguous_pernuma_area[MAX_NUMNODES];
> > +static phys_addr_t pernuma_size_bytes __initdata;
> > +
> > +static int __init early_pernuma_cma(char *p)
> > +{
> > + pernuma_size_bytes = memparse(p, &p);
> > + return 0;
> > +}
> > +early_param("pernuma_cma", early_pernuma_cma);
> > +#endif
> > +
> > #ifdef CONFIG_CMA_SIZE_PERCENTAGE
> >
> > static phys_addr_t __init __maybe_unused
> cma_early_percent_memory(void)
> > @@ -96,6 +109,34 @@ static inline __maybe_unused phys_addr_t
> cma_early_percent_memory(void)
> >
> > #endif
> >
> > +#ifdef CONFIG_DMA_PERNUMA_CMA
> > +void __init dma_pernuma_cma_reserve(void)
> > +{
> > + int nid;
> > +
> > + if (!pernuma_size_bytes)
> > + return;
> > +
> > + for_each_online_node(nid) {
> > + int ret;
> > + char name[20];
> > + struct cma **cma = &dma_contiguous_pernuma_area[nid];
> > +
> > + snprintf(name, sizeof(name), "pernuma%d", nid);
> > + ret = cma_declare_contiguous_nid(0, pernuma_size_bytes, 0, 0,
> > + 0, false, name, cma, nid);
> > + if (ret) {
> > + pr_warn("%s: reservation failed: err %d, node %d", __func__,
> > + ret, nid);
> > + continue;
> > + }
> > +
> > + pr_debug("%s: reserved %llu MiB on node %d\n", __func__,
> > + (unsigned long long)pernuma_size_bytes / SZ_1M, nid);
> > + }
> > +}
> > +#endif
> > +
> > /**
> > * dma_contiguous_reserve() - reserve area(s) for contiguous memory
> handling
> > * @limit: End address of the reserved memory (optional, 0 for any).
> > @@ -228,23 +269,44 @@ static struct page *cma_alloc_aligned(struct cma
> *cma, size_t size, gfp_t gfp)
> > * @size: Requested allocation size.
> > * @gfp: Allocation flags.
> > *
> > - * This function allocates contiguous memory buffer for specified device. It
> > - * tries to use device specific contiguous memory area if available, or the
> > - * default global one.
> > + * tries to use device specific contiguous memory area if available, or it
> > + * tries to use per-numa cma, if the allocation fails, it will fallback to
> > + * try default global one.
> > *
> > - * Note that it byapss one-page size of allocations from the global area as
> > - * the addresses within one page are always contiguous, so there is no need
> > - * to waste CMA pages for that kind; it also helps reduce fragmentations.
> > + * Note that it bypass one-page size of allocations from the per-numa and
> > + * global area as the addresses within one page are always contiguous, so
> > + * there is no need to waste CMA pages for that kind; it also helps reduce
> > + * fragmentations.
> > */
> > struct page *dma_alloc_contiguous(struct device *dev, size_t size, gfp_t
> gfp)
> > {
> > +#ifdef CONFIG_DMA_PERNUMA_CMA
> > + int nid = dev_to_node(dev);
> > +#endif
> > +
> > /* CMA can be used only in the context which permits sleeping */
> > if (!gfpflags_allow_blocking(gfp))
> > return NULL;
> > if (dev->cma_area)
> > return cma_alloc_aligned(dev->cma_area, size, gfp);
> > - if (size <= PAGE_SIZE || !dma_contiguous_default_area)
> > + if (size <= PAGE_SIZE)
> > + return NULL;
> > +
> > +#ifdef CONFIG_DMA_PERNUMA_CMA
> > + if (nid != NUMA_NO_NODE && !(gfp & (GFP_DMA | GFP_DMA32))) {
> > + struct cma *cma = dma_contiguous_pernuma_area[nid];
>
> It could be that for some node reservation failedm than
> dma_contiguous_pernuma_area[nid] would be NULL.
> I'd add a fallback to another node here.

This has been done.
If dma_contiguous_pernuma_area[nid] is null, it will fallback to the default global cma.

>
> > + struct page *page;
> > +
> > + if (cma) {
> > + page = cma_alloc_aligned(cma, size, gfp);
> > + if (page)
> > + return page;
> > + }
> > + }
> > +#endif
>
> I think the selection of the area can be put in a helper funciton and
> then here we just try to allocate from the selected area. E.g.
>
> static struct cma* dma_get_cma_area(struct device *dev)
> {
> #ifdef CONFIG_DMA_PERNUMA_CMA
> int nid = dev_to_node(dev);
> struct cma *cma = dma_contiguous_pernuma_area[nid];
>
> if (!cma)
> /* select cma from another node */ ;
>
> return cma;
> #else
> return dma_contiguous_default_area;
> #endif
> }
>

It is possible dma_contiguous_pernuma_area[nid] is not null, but we fail to get memory
from it due to it is either full or has no GFP_DMA(32) support. In this case, we still need
to fallback to the default global cma. So the code is trying pernuma_cma, then trying
default global cma. It is not picking one from the two areas. It is trying both.

> struct page *dma_alloc_contiguous(struct device *dev, size_t size, gfp_t gfp)
> {
> struct cma *cma;
> ...
>
>
> cma = dma_get_cma_area(dev);
> if (!cma)
> return NULL;
>
> return cma_alloc_aligned(cma, size, gfp);
> }
>
> > + if (!dma_contiguous_default_area)
> > return NULL;
> > +
> > return cma_alloc_aligned(dma_contiguous_default_area, size, gfp);
> > }
> >
> > @@ -261,9 +323,27 @@ struct page *dma_alloc_contiguous(struct device
> *dev, size_t size, gfp_t gfp)
> > */
> > void dma_free_contiguous(struct device *dev, struct page *page, size_t
> size)
> > {
> > - if (!cma_release(dev_get_cma_area(dev), page,
> > - PAGE_ALIGN(size) >> PAGE_SHIFT))
> > - __free_pages(page, get_order(size));
>
> Here as well, dev_get_cma_area() can be replaced with, say
> dma_get_dev_cma_area(dev, page) that will hide the below logic.

As explained above, this won't work.

>
> > + unsigned int count = PAGE_ALIGN(size) >> PAGE_SHIFT;
> > +
> > + /* if dev has its own cma, free page from there */
> > + if (dev->cma_area) {
> > + if (cma_release(dev->cma_area, page, count))
> > + return;
> > + } else {
> > + /*
> > + * otherwise, page is from either per-numa cma or default cma
> > + */
> > +#ifdef CONFIG_DMA_PERNUMA_CMA
> > + if (cma_release(dma_contiguous_pernuma_area[page_to_nid(page)],
> > + page, count))
> > + return;
> > +#endif
> > + if (cma_release(dma_contiguous_default_area, page, count))
> > + return;
> > + }
> > +
> > + /* not in any cma, free from buddy */
> > + __free_pages(page, get_order(size));
> > }
> >
> > /*
> > --
> > 2.27.0
> >
> >

Thanks
Barry