Messages in this thread |  | From | "Song Bao Hua (Barry Song)" <> | Subject | RE: [PATCH v7 1/3] dma-contiguous: provide the ability to reserve per-numa CMA | Date | Fri, 21 Aug 2020 19:14:56 +0000 |
| |
> -----Original Message----- > From: Mike Rapoport [mailto:rppt@linux.ibm.com] > Sent: Saturday, August 22, 2020 2:28 AM > To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com> > Cc: hch@lst.de; m.szyprowski@samsung.com; robin.murphy@arm.com; > will@kernel.org; ganapatrao.kulkarni@cavium.com; > catalin.marinas@arm.com; akpm@linux-foundation.org; > iommu@lists.linux-foundation.org; linux-arm-kernel@lists.infradead.org; > linux-kernel@vger.kernel.org; Zengtao (B) <prime.zeng@hisilicon.com>; > huangdaode <huangdaode@huawei.com>; Linuxarm <linuxarm@huawei.com>; > Jonathan Cameron <jonathan.cameron@huawei.com>; Nicolas Saenz Julienne > <nsaenzjulienne@suse.de>; Steve Capper <steve.capper@arm.com> > Subject: Re: [PATCH v7 1/3] dma-contiguous: provide the ability to reserve > per-numa CMA > > On Fri, Aug 21, 2020 at 11:33:53PM +1200, Barry Song wrote: > > Right now, drivers like ARM SMMU are using dma_alloc_coherent() to get > > coherent DMA buffers to save their command queues and page tables. As > > there is only one default CMA in the whole system, SMMUs on nodes other > > than node0 will get remote memory. This leads to significant latency. > > > > This patch provides per-numa CMA so that drivers like SMMU can get local > > memory. Tests show localizing CMA can decrease dma_unmap latency much. > > For instance, before this patch, SMMU on node2 has to wait for more than > > 560ns for the completion of CMD_SYNC in an empty command queue; with > this > > patch, it needs 240ns only. > > > > A positive side effect of this patch would be improving performance even > > further for those users who are worried about performance more than DMA > > security and use iommu.passthrough=1 to skip IOMMU. With local CMA, all > > drivers can get local coherent DMA buffers. > > > > Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> > > Cc: Christoph Hellwig <hch@lst.de> > > Cc: Marek Szyprowski <m.szyprowski@samsung.com> > > Cc: Will Deacon <will@kernel.org> > > Cc: Robin Murphy <robin.murphy@arm.com> > > Cc: Ganapatrao Kulkarni <ganapatrao.kulkarni@cavium.com> > > Cc: Catalin Marinas <catalin.marinas@arm.com> > > Cc: Nicolas Saenz Julienne <nsaenzjulienne@suse.de> > > Cc: Steve Capper <steve.capper@arm.com> > > Cc: Andrew Morton <akpm@linux-foundation.org> > > Cc: Mike Rapoport <rppt@linux.ibm.com> > > Signed-off-by: Barry Song <song.bao.hua@hisilicon.com> > > --- > > -v7: with respect to Will's comments > > * move to use for_each_online_node > > * add description if users don't specify pernuma_cma > > * provide default value for CONFIG_DMA_PERNUMA_CMA > > > > .../admin-guide/kernel-parameters.txt | 11 ++ > > include/linux/dma-contiguous.h | 6 ++ > > kernel/dma/Kconfig | 11 ++ > > kernel/dma/contiguous.c | 100 > ++++++++++++++++-- > > 4 files changed, 118 insertions(+), 10 deletions(-) > > > > diff --git a/Documentation/admin-guide/kernel-parameters.txt > b/Documentation/admin-guide/kernel-parameters.txt > > index bdc1f33fd3d1..c609527fc35a 100644 > > --- a/Documentation/admin-guide/kernel-parameters.txt > > +++ b/Documentation/admin-guide/kernel-parameters.txt > > @@ -599,6 +599,17 @@ > > altogether. For more information, see > > include/linux/dma-contiguous.h > > > > + pernuma_cma=nn[MG] > > Maybe cma_pernuma or cma_pernode?
Sounds good.
> > > + [ARM64,KNL] > > + Sets the size of kernel per-numa memory area for > > + contiguous memory allocations. A value of 0 disables > > + per-numa CMA altogether. And If this option is not > > + specificed, the default value is 0. > > + With per-numa CMA enabled, DMA users on node nid will > > + first try to allocate buffer from the pernuma area > > + which is located in node nid, if the allocation fails, > > + they will fallback to the global default memory area. > > + > > cmo_free_hint= [PPC] Format: { yes | no } > > Specify whether pages are marked as being inactive > > when they are freed. This is used in CMO environments > > diff --git a/include/linux/dma-contiguous.h > b/include/linux/dma-contiguous.h > > index 03f8e98e3bcc..fe55e004f1f4 100644 > > --- a/include/linux/dma-contiguous.h > > +++ b/include/linux/dma-contiguous.h > > @@ -171,6 +171,12 @@ static inline void dma_free_contiguous(struct > device *dev, struct page *page, > > > > #endif > > > > +#ifdef CONFIG_DMA_PERNUMA_CMA > > +void dma_pernuma_cma_reserve(void); > > +#else > > +static inline void dma_pernuma_cma_reserve(void) { } > > +#endif > > + > > #endif > > > > #endif > > diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig > > index 847a9d1fa634..c38979d45b13 100644 > > --- a/kernel/dma/Kconfig > > +++ b/kernel/dma/Kconfig > > @@ -118,6 +118,17 @@ config DMA_CMA > > If unsure, say "n". > > > > if DMA_CMA > > + > > +config DMA_PERNUMA_CMA > > + bool "Enable separate DMA Contiguous Memory Area for each NUMA > Node" > > + default NUMA && ARM64 > > + help > > + Enable this option to get pernuma CMA areas so that devices like > > + ARM64 SMMU can get local memory by DMA coherent APIs. > > + > > + You can set the size of pernuma CMA by specifying > "pernuma_cma=size" > > + on the kernel's command line. > > + > > comment "Default contiguous memory area size:" > > > > config CMA_SIZE_MBYTES > > diff --git a/kernel/dma/contiguous.c b/kernel/dma/contiguous.c > > index cff7e60968b9..0383c9b86715 100644 > > --- a/kernel/dma/contiguous.c > > +++ b/kernel/dma/contiguous.c > > @@ -69,6 +69,19 @@ static int __init early_cma(char *p) > > } > > early_param("cma", early_cma); > > > > +#ifdef CONFIG_DMA_PERNUMA_CMA > > + > > +static struct cma *dma_contiguous_pernuma_area[MAX_NUMNODES]; > > +static phys_addr_t pernuma_size_bytes __initdata; > > + > > +static int __init early_pernuma_cma(char *p) > > +{ > > + pernuma_size_bytes = memparse(p, &p); > > + return 0; > > +} > > +early_param("pernuma_cma", early_pernuma_cma); > > +#endif > > + > > #ifdef CONFIG_CMA_SIZE_PERCENTAGE > > > > static phys_addr_t __init __maybe_unused > cma_early_percent_memory(void) > > @@ -96,6 +109,34 @@ static inline __maybe_unused phys_addr_t > cma_early_percent_memory(void) > > > > #endif > > > > +#ifdef CONFIG_DMA_PERNUMA_CMA > > +void __init dma_pernuma_cma_reserve(void) > > +{ > > + int nid; > > + > > + if (!pernuma_size_bytes) > > + return; > > + > > + for_each_online_node(nid) { > > + int ret; > > + char name[20]; > > + struct cma **cma = &dma_contiguous_pernuma_area[nid]; > > + > > + snprintf(name, sizeof(name), "pernuma%d", nid); > > + ret = cma_declare_contiguous_nid(0, pernuma_size_bytes, 0, 0, > > + 0, false, name, cma, nid); > > + if (ret) { > > + pr_warn("%s: reservation failed: err %d, node %d", __func__, > > + ret, nid); > > + continue; > > + } > > + > > + pr_debug("%s: reserved %llu MiB on node %d\n", __func__, > > + (unsigned long long)pernuma_size_bytes / SZ_1M, nid); > > + } > > +} > > +#endif > > + > > /** > > * dma_contiguous_reserve() - reserve area(s) for contiguous memory > handling > > * @limit: End address of the reserved memory (optional, 0 for any). > > @@ -228,23 +269,44 @@ static struct page *cma_alloc_aligned(struct cma > *cma, size_t size, gfp_t gfp) > > * @size: Requested allocation size. > > * @gfp: Allocation flags. > > * > > - * This function allocates contiguous memory buffer for specified device. It > > - * tries to use device specific contiguous memory area if available, or the > > - * default global one. > > + * tries to use device specific contiguous memory area if available, or it > > + * tries to use per-numa cma, if the allocation fails, it will fallback to > > + * try default global one. > > * > > - * Note that it byapss one-page size of allocations from the global area as > > - * the addresses within one page are always contiguous, so there is no need > > - * to waste CMA pages for that kind; it also helps reduce fragmentations. > > + * Note that it bypass one-page size of allocations from the per-numa and > > + * global area as the addresses within one page are always contiguous, so > > + * there is no need to waste CMA pages for that kind; it also helps reduce > > + * fragmentations. > > */ > > struct page *dma_alloc_contiguous(struct device *dev, size_t size, gfp_t > gfp) > > { > > +#ifdef CONFIG_DMA_PERNUMA_CMA > > + int nid = dev_to_node(dev); > > +#endif > > + > > /* CMA can be used only in the context which permits sleeping */ > > if (!gfpflags_allow_blocking(gfp)) > > return NULL; > > if (dev->cma_area) > > return cma_alloc_aligned(dev->cma_area, size, gfp); > > - if (size <= PAGE_SIZE || !dma_contiguous_default_area) > > + if (size <= PAGE_SIZE) > > + return NULL; > > + > > +#ifdef CONFIG_DMA_PERNUMA_CMA > > + if (nid != NUMA_NO_NODE && !(gfp & (GFP_DMA | GFP_DMA32))) { > > + struct cma *cma = dma_contiguous_pernuma_area[nid]; > > It could be that for some node reservation failedm than > dma_contiguous_pernuma_area[nid] would be NULL. > I'd add a fallback to another node here.
This has been done. If dma_contiguous_pernuma_area[nid] is null, it will fallback to the default global cma.
> > > + struct page *page; > > + > > + if (cma) { > > + page = cma_alloc_aligned(cma, size, gfp); > > + if (page) > > + return page; > > + } > > + } > > +#endif > > I think the selection of the area can be put in a helper funciton and > then here we just try to allocate from the selected area. E.g. > > static struct cma* dma_get_cma_area(struct device *dev) > { > #ifdef CONFIG_DMA_PERNUMA_CMA > int nid = dev_to_node(dev); > struct cma *cma = dma_contiguous_pernuma_area[nid]; > > if (!cma) > /* select cma from another node */ ; > > return cma; > #else > return dma_contiguous_default_area; > #endif > } >
It is possible dma_contiguous_pernuma_area[nid] is not null, but we fail to get memory from it due to it is either full or has no GFP_DMA(32) support. In this case, we still need to fallback to the default global cma. So the code is trying pernuma_cma, then trying default global cma. It is not picking one from the two areas. It is trying both.
> struct page *dma_alloc_contiguous(struct device *dev, size_t size, gfp_t gfp) > { > struct cma *cma; > ... > > > cma = dma_get_cma_area(dev); > if (!cma) > return NULL; > > return cma_alloc_aligned(cma, size, gfp); > } > > > + if (!dma_contiguous_default_area) > > return NULL; > > + > > return cma_alloc_aligned(dma_contiguous_default_area, size, gfp); > > } > > > > @@ -261,9 +323,27 @@ struct page *dma_alloc_contiguous(struct device > *dev, size_t size, gfp_t gfp) > > */ > > void dma_free_contiguous(struct device *dev, struct page *page, size_t > size) > > { > > - if (!cma_release(dev_get_cma_area(dev), page, > > - PAGE_ALIGN(size) >> PAGE_SHIFT)) > > - __free_pages(page, get_order(size)); > > Here as well, dev_get_cma_area() can be replaced with, say > dma_get_dev_cma_area(dev, page) that will hide the below logic.
As explained above, this won't work.
> > > + unsigned int count = PAGE_ALIGN(size) >> PAGE_SHIFT; > > + > > + /* if dev has its own cma, free page from there */ > > + if (dev->cma_area) { > > + if (cma_release(dev->cma_area, page, count)) > > + return; > > + } else { > > + /* > > + * otherwise, page is from either per-numa cma or default cma > > + */ > > +#ifdef CONFIG_DMA_PERNUMA_CMA > > + if (cma_release(dma_contiguous_pernuma_area[page_to_nid(page)], > > + page, count)) > > + return; > > +#endif > > + if (cma_release(dma_contiguous_default_area, page, count)) > > + return; > > + } > > + > > + /* not in any cma, free from buddy */ > > + __free_pages(page, get_order(size)); > > } > > > > /* > > -- > > 2.27.0 > > > >
Thanks Barry
|  |