Normalization Functions
Source:vignettes/articles/Normalization-Functions.Rmd
Normalization-Functions.RmdData
The data used in this article comes from only from the following source:
- A
SummarizedExperimentobject from the libraryairway
In this case we are using a single object given that it can be
manipulated as a SummarizedExperiment or
data.frame
### summarized experiment
data("airway")
se <-airwayNormalization Methods
Normalization is critical for correcting technical biases and enabling meaningful biological comparisons.
The package contains different normalization methods, some of them
include a log_transf attribute that calculates the
log2(x+1) of the normalized value if set to
TRUE:
cpm_normalizationminmax_normalizationquantile_normalizationrpkm_normalizationtpm_normalization
Let’s explore the usage of each normalization method on the
airway data.
Min-Max Normalization
Min-Max normalization is a linear transformation technique that rescales each gene’s expression values to a specified range (typically [0, 1]). This normalization method is useful when you want to bring the data onto the same scale.
Function Purpose:
· Rescales each column to fit within a range [new_min, new_max].
· Preserves the relative structure of values within each column.
· Useful when different assays or samples have varying scales.
Example 1: Normalize a matrix
# Prepare input matrix
count_mat <- assay(se)
# Apply min-max normalization
se_minmax <- minmax_normalization(count_mat, new_min = 0, new_max = 1)
# Inspect structure
dim(se_minmax)
#> [1] 63677 8
summary(as.vector(se_minmax))
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 0.000e+00 0.000e+00 0.000e+00 9.679e-04 2.739e-05 1.000e+00
head(se_minmax[, 1:5])
#> SRR1039508 SRR1039509 SRR1039512 SRR1039513
#> ENSG00000000003 0.0022792424 0.0017523136 1.699217e-03 0.0014897144
#> ENSG00000000005 0.0000000000 0.0000000000 0.000000e+00 0.0000000000
#> ENSG00000000419 0.0015676086 0.0020143784 1.208721e-03 0.0013327102
#> ENSG00000000457 0.0008727585 0.0008253084 5.119062e-04 0.0005988068
#> ENSG00000000460 0.0002014058 0.0002151278 7.785646e-05 0.0001277941
#> ENSG00000000938 0.0000000000 0.0000000000 3.892823e-06 0.0000000000
#> SRR1039516
#> ENSG00000000003 2.860799e-03
#> ENSG00000000005 0.000000e+00
#> ENSG00000000419 1.475649e-03
#> ENSG00000000457 6.159013e-04
#> ENSG00000000460 1.960829e-04
#> ENSG00000000938 2.513883e-06You can set new_min = 10 and new_max = 20 if your downstream application prefers values in a different scale:
df_scaled <- minmax_normalization(count_mat, new_min = 10, new_max = 20)
head(df_scaled) # All columns now range from 10 to 20
#> SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516
#> ENSG00000000003 10.02279 10.01752 10.01699 10.01490 10.02861
#> ENSG00000000005 10.00000 10.00000 10.00000 10.00000 10.00000
#> ENSG00000000419 10.01568 10.02014 10.01209 10.01333 10.01476
#> ENSG00000000457 10.00873 10.00825 10.00512 10.00599 10.00616
#> ENSG00000000460 10.00201 10.00215 10.00078 10.00128 10.00196
#> ENSG00000000938 10.00000 10.00000 10.00004 10.00000 10.00003
#> SRR1039517 SRR1039520 SRR1039521
#> ENSG00000000003 10.02607 10.02033 10.01536
#> ENSG00000000005 10.00000 10.00000 10.00000
#> ENSG00000000419 10.01990 10.01101 10.01364
#> ENSG00000000457 10.00824 10.00615 10.00615
#> ENSG00000000460 10.00157 10.00201 10.00161
#> ENSG00000000938 10.00000 10.00000 10.00000Example 2: Normalize a SummarizedExperiment
se <- se
# Option A: Overwrite the default assay
se1 <- minmax_normalization(se)
head(assay(se1))
#> SRR1039508 SRR1039509 SRR1039512 SRR1039513
#> ENSG00000000003 0.0022792424 0.0017523136 1.699217e-03 0.0014897144
#> ENSG00000000005 0.0000000000 0.0000000000 0.000000e+00 0.0000000000
#> ENSG00000000419 0.0015676086 0.0020143784 1.208721e-03 0.0013327102
#> ENSG00000000457 0.0008727585 0.0008253084 5.119062e-04 0.0005988068
#> ENSG00000000460 0.0002014058 0.0002151278 7.785646e-05 0.0001277941
#> ENSG00000000938 0.0000000000 0.0000000000 3.892823e-06 0.0000000000
#> SRR1039516 SRR1039517 SRR1039520 SRR1039521
#> ENSG00000000003 2.860799e-03 0.0026074678 0.0020325525 0.0015356158
#> ENSG00000000005 0.000000e+00 0.0000000000 0.0000000000 0.0000000000
#> ENSG00000000419 1.475649e-03 0.0019898441 0.0011007460 0.0013637987
#> ENSG00000000457 6.159013e-04 0.0008243284 0.0006150451 0.0006147833
#> ENSG00000000460 1.960829e-04 0.0001568963 0.0002006156 0.0001610786
#> ENSG00000000938 2.513883e-06 0.0000000000 0.0000000000 0.0000000000
# Option B: Write to a new assay slot
se2 <- minmax_normalization(se, new_assay_name = "minmax_counts")By using the option new_assay_name it is possible to
store the normalized data in a new assay in the summarizedexperiment
object keeping the count matrix intact. If no name is provided upon
normalization, then the function will overwrite the count matrix
Quantile Normalization
Quantile normalization makes the distribution of values across all samples identical. This technique adjusts the data so that the rank distributions of the data across samples are equal.
Example 1: Normalize a matrix
count_mat <- assay(se)
se_quantile <- quantile_normalization(count_mat)
## Check result
dim((se_quantile))
#> [1] 63677 8
summary(as.vector(se_quantile))
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 0.00 0.00 0.00 344.36 9.62 361483.12
head(se_quantile[1:5, 1:5])
#> SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516
#> ENSG00000000003 690.875 504.750 773.875 613.75 1010.000
#> ENSG00000000005 0.000 0.000 0.000 0.00 0.000
#> ENSG00000000419 468.875 582.375 550.625 552.00 516.875
#> ENSG00000000457 257.375 241.375 225.250 254.00 213.125
#> ENSG00000000460 58.000 65.250 31.500 53.75 67.750Example 2: Normalize a summarized experiment
## Apply quantile normalization to a SummarizedExperiment
se_quantile <- quantile_normalization(se)
## Check result
dim(assay(se_quantile))
#> [1] 63677 8
summary(as.vector(assay(se_quantile)))
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 0.00 0.00 0.00 344.36 9.62 361483.12
head(assay(se_quantile)[1:5, 1:5])
#> SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516
#> ENSG00000000003 690.875 504.750 773.875 613.75 1010.000
#> ENSG00000000005 0.000 0.000 0.000 0.00 0.000
#> ENSG00000000419 468.875 582.375 550.625 552.00 516.875
#> ENSG00000000457 257.375 241.375 225.250 254.00 213.125
#> ENSG00000000460 58.000 65.250 31.500 53.75 67.750CPM Normalization
The cpm_normalization() function rescales raw count data such that each column sums to one million. This makes count data comparable across samples of different sequencing depths.
Example 1: Normalize a data.frame
df <- assay(se)
# Normalize without log2-transform
df_cpm <- cpm_normalization(df, log_trans = FALSE)
head(df_cpm[, 1:5])
#> SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516
#> ENSG00000000003 32.900521 23.817776 34.43970525 26.906868 46.54699807
#> ENSG00000000005 0.000000 0.000000 0.00000000 0.000000 0.00000000
#> ENSG00000000419 22.628193 27.379809 24.49834703 24.071095 24.00974329
#> ENSG00000000457 12.598138 11.217747 10.37530639 10.815506 10.02110240
#> ENSG00000000460 2.907263 2.924057 1.57799337 2.308187 3.19039178
#> ENSG00000000938 0.000000 0.000000 0.07889967 0.000000 0.04090246
# Normalize with log2-transform
df_cpm_log <- cpm_normalization(df, log_trans = TRUE)
head(df_cpm_log[, 1:5])
#> SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516
#> ENSG00000000003 5.083236 4.633302 5.1472947 4.802548 5.57128235
#> ENSG00000000005 0.000000 0.000000 0.0000000 0.000000 0.00000000
#> ENSG00000000419 4.562437 4.826793 4.6723318 4.647953 4.64441834
#> ENSG00000000457 3.765337 3.610906 3.5078335 3.562609 3.46219663
#> ENSG00000000460 1.966158 1.972346 1.3662486 1.726041 2.06708514
#> ENSG00000000938 0.000000 0.000000 0.1095607 0.000000 0.05783488Example 2: Normalize a SummarizedExperiment
# Apply in-place normalization (overwrite assay)
se1 <- cpm_normalization(se, log_trans = FALSE)
head(assay(se1))
#> SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516
#> ENSG00000000003 32.900521 23.817776 34.43970525 26.906868 46.54699807
#> ENSG00000000005 0.000000 0.000000 0.00000000 0.000000 0.00000000
#> ENSG00000000419 22.628193 27.379809 24.49834703 24.071095 24.00974329
#> ENSG00000000457 12.598138 11.217747 10.37530639 10.815506 10.02110240
#> ENSG00000000460 2.907263 2.924057 1.57799337 2.308187 3.19039178
#> ENSG00000000938 0.000000 0.000000 0.07889967 0.000000 0.04090246
#> SRR1039517 SRR1039520 SRR1039521
#> ENSG00000000003 33.973415 40.259015 27.026857
#> ENSG00000000005 0.000000 0.000000 0.000000
#> ENSG00000000419 25.926226 21.802609 24.002873
#> ENSG00000000457 10.740401 12.182273 10.820193
#> ENSG00000000460 2.044246 3.973617 2.834985
#> ENSG00000000938 0.000000 0.000000 0.000000
# Save to a new assay slot
se2 <- cpm_normalization(se, log_trans = TRUE, new_assay_name =
"cpm_logged")
head(assay(se2, "cpm_logged"))
#> SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516
#> ENSG00000000003 5.083236 4.633302 5.1472947 4.802548 5.57128235
#> ENSG00000000005 0.000000 0.000000 0.0000000 0.000000 0.00000000
#> ENSG00000000419 4.562437 4.826793 4.6723318 4.647953 4.64441834
#> ENSG00000000457 3.765337 3.610906 3.5078335 3.562609 3.46219663
#> ENSG00000000460 1.966158 1.972346 1.3662486 1.726041 2.06708514
#> ENSG00000000938 0.000000 0.000000 0.1095607 0.000000 0.05783488
#> SRR1039517 SRR1039520 SRR1039521
#> ENSG00000000003 5.128187 5.366637 4.808738
#> ENSG00000000005 0.000000 0.000000 0.000000
#> ENSG00000000419 4.750940 4.511127 4.644022
#> ENSG00000000457 3.553410 3.720527 3.563182
#> ENSG00000000460 1.606085 2.314295 1.939221
#> ENSG00000000938 0.000000 0.000000 0.000000RPKM Normalization
Reads per kilobase per million (RPKM) normalization adjusts for both gene length and sequencing depth, making it particularly useful for RNA-Seq data. RPKM helps compare gene expression levels across genes of different lengths.
Example 1: Normalize a data.frame
df <- assay(se)
length <- rowData(se)$gene_seq_end - rowData(se)$gene_seq_start
df_rpkm <- rpkm_normalization(df, gene_length = length)
head(df_rpkm[, 1:5])
#> SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516
#> ENSG00000000003 2.90614973 2.10385794 3.042108051 2.37672181 4.11156241
#> ENSG00000000005 0.00000000 0.00000000 0.000000000 0.00000000 0.00000000
#> ENSG00000000419 0.95525977 1.15585145 1.034209179 1.01617253 1.01358254
#> ENSG00000000457 0.28224164 0.25131614 0.232442566 0.24230454 0.22450718
#> ENSG00000000460 0.01514389 0.01523137 0.008219743 0.01202331 0.01661870
#> ENSG00000000938 0.00000000 0.00000000 0.003398943 0.00000000 0.00176205Example 2: Normalize a SummarizedExperiment
## Gene length needed
rowData(se)$gene_length <- rowData(se)$gene_seq_end - rowData(se)$gene_seq_start
## Apply RPKM normalization
se_rpkm <- rpkm_normalization(se, gene_length, log_trans = TRUE)
## Check the result
dim(assay(se_rpkm))
#> [1] 63677 8
head(assay(se_rpkm)[1:5, 1:5])
#> SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516
#> ENSG00000000003 1.96574725 1.63406253 2.01510789 1.75562333 2.35376434
#> ENSG00000000005 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000
#> ENSG00000000419 0.96736029 1.10825777 1.02446804 1.01161910 1.00976461
#> ENSG00000000457 0.35866816 0.32344632 0.30152042 0.31301889 0.29220123
#> ENSG00000000460 0.02168423 0.02180855 0.01181011 0.01724252 0.02377868TPM Normalization
Transcripts per million normalization.
Example 1: Normalize a data.frame
df <- assay(se)
length <- sample(c(400:800), nrow(df), replace = TRUE)
df_tpm <- tpm_normalization(df, gene_length = length)
head(df_tpm[, 1:5])
#> SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516
#> ENSG00000000003 33.390121 24.122006 34.96289802 27.202754 47.35784714
#> ENSG00000000005 0.000000 0.000000 0.00000000 0.000000 0.00000000
#> ENSG00000000419 17.352628 20.952835 18.79251602 18.388476 18.45814018
#> ENSG00000000457 9.547481 8.483696 7.86531621 8.165143 7.61347687
#> ENSG00000000460 3.784083 3.798038 2.05453877 2.992829 4.16299091
#> ENSG00000000938 0.000000 0.000000 0.07636036 0.000000 0.03967295Example 2: Normalize a SummarizedExperiment
## Gene length needed
rowData(se)$gene_length <- rowData(se)$gene_seq_end - rowData(se)$gene_seq_start
## Apply RPKM normalization
se_tpm <- tpm_normalization(se, gene_length, log_trans = TRUE)
## Check the result
dim(assay(se_tpm))
#> [1] 63677 8
head(assay(se_tpm)[1:5, 1:5])
#> SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516
#> ENSG00000000003 4.4294986 3.9577411 4.55434051 4.2249660 4.8647104
#> ENSG00000000005 0.0000000 0.0000000 0.00000000 0.0000000 0.0000000
#> ENSG00000000419 2.9549906 3.1678707 3.11233806 3.0989031 2.9884095
#> ENSG00000000457 1.5828547 1.4524115 1.44301228 1.4877425 1.3427320
#> ENSG00000000460 0.1467549 0.1443756 0.08513068 0.1237197 0.1553897Session Info
#> R version 4.5.1 (2025-06-13)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.3 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
#> [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
#> [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
#> [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] ggplot2_4.0.0 dominatRData_0.99.1
#> [3] airway_1.28.0 SummarizedExperiment_1.38.1
#> [5] Biobase_2.68.0 GenomicRanges_1.60.0
#> [7] GenomeInfoDb_1.44.3 IRanges_2.42.0
#> [9] S4Vectors_0.46.0 BiocGenerics_0.54.1
#> [11] generics_0.1.4 MatrixGenerics_1.20.0
#> [13] matrixStats_1.5.0 dominatR_0.99.4
#> [15] knitr_1.50
#>
#> loaded via a namespace (and not attached):
#> [1] gtable_0.3.6 xfun_0.53 bslib_0.9.0
#> [4] lattice_0.22-7 vctrs_0.6.5 tools_4.5.1
#> [7] tibble_3.3.0 pkgconfig_2.0.3 Matrix_1.7-3
#> [10] ggnewscale_0.5.2 RColorBrewer_1.1-3 S7_0.2.0
#> [13] desc_1.4.3 lifecycle_1.0.4 GenomeInfoDbData_1.2.14
#> [16] compiler_4.5.1 farver_2.1.2 textshaping_1.0.4
#> [19] geomtextpath_0.2.0 ggforce_0.5.0 htmltools_0.5.8.1
#> [22] sass_0.4.10 yaml_2.3.10 pillar_1.11.1
#> [25] pkgdown_2.1.3 crayon_1.5.3 jquerylib_0.1.4
#> [28] MASS_7.3-65 DelayedArray_0.34.1 cachem_1.1.0
#> [31] abind_1.4-8 tidyselect_1.2.1 digest_0.6.37
#> [34] dplyr_1.1.4 polyclip_1.10-7 fastmap_1.2.0
#> [37] grid_4.5.1 cli_3.6.5 SparseArray_1.8.1
#> [40] magrittr_2.0.4 S4Arrays_1.8.1 withr_3.0.2
#> [43] UCSC.utils_1.4.0 scales_1.4.0 rmarkdown_2.30
#> [46] XVector_0.48.0 httr_1.4.7 ragg_1.5.0
#> [49] evaluate_1.0.5 rlang_1.1.6 glue_1.8.0
#> [52] tweenr_2.0.3 jsonlite_2.0.0 R6_2.6.1
#> [55] systemfonts_1.3.1 fs_1.6.6