Author: Emmanuel Dollinger

Introduction

Goal of this notebook is to align mouse single cell transcriptomic reads using cellranger and kallisto to compare the outputs. All cellranger code was ran on the HPC3, all kallisto code was run on a desktop with 10 cores (which right there is a clue as to which aligner I prefer). I copied the code ran on the HPC3 below.

Part 1: cellranger

First I built a reference, see https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/tutorial_mr

# wget files from Ensembl:
# !wget ftp://ftp.ensembl.org/pub/release-103/fasta/mus_musculus/dna/Mus_musculus.GRCm39.dna.primary_assembly.fa.gz
# !wget ftp://ftp.ensembl.org/pub/release-103/gtf/mus_musculus/Mus_musculus.GRCm39.103.gtf.gz

Uploaded both files to HPC3 using scp.

First filtered the gtf, script submitted: (see https://rcic.uci.edu/hpc3/examples.html for slurm script examples)

#!/bin/bash

#SBATCH --job-name=cellrangermkref      ## Name of the job.
#SBATCH -A qnie_lab     ## account to charge
#SBATCH -p standard          ## partition/queue name
#SBATCH --nodes=1            ## (-N) number of nodes to use
#SBATCH --ntasks=1           ## (-n) number of tasks to launch
#SBATCH --cpus-per-task=10    ## number of cores the job needs
#SBATCH --error=slurm-%J.err ## error log file

# Attempt to make custom cellranger reference. See https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/tutorial_mr

module load cellranger/3.1.0

cellranger mkgtf \
/pub/edolling/GenPALSReadAlignerComparison/Data/Mus_musculus.GRCm39.103.gtf \
/pub/edolling/GenPALSReadAlignerComparison/Data/Mus_musculus.GRCm39.103.filtered.gtf

Took >5 minutes.

Now create custom index: Notice the queue I submitted to is the highmem queue, you need to ask the HPC3 folks to add you to this queue.

#!/bin/bash

#SBATCH --job-name=cellrangermkref      ## Name of the job.
#SBATCH -A qnie_lab     ## account to charge
#SBATCH -p highmem          ## partition/queue name
#SBATCH --nodes=1            ## (-N) number of nodes to use
#SBATCH --ntasks=1           ## (-n) number of tasks to launch
#SBATCH --cpus-per-task=20    ## number of cores the job needs
#SBATCH --error=slurm-%J.err ## error log file

# Attempt to make custom cellranger reference. See https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/tutorial_mr

module load cellranger/3.1.0

cellranger mkref \
--genome=cellranger_ref_GRCm39.103 \
--genes=/pub/edolling/GenPALSReadAlignerComparison/Data/Mus_musculus.GRCm39.103.filtered.gtf \
--fasta=/pub/edolling/GenPALSReadAlignerComparison/Data/Mus_musculus.GRCm39.dna.primary_assembly.fa

~45 minutes.

Now I actually ran the aligner on that index. This took on the order of 5h with 2 nodes * 20 cores. Below is the code to align mouse F1, F2 was exactly the same.

#!/bin/bash

#SBATCH --job-name=count_F1      ## Name of the job.
#SBATCH -A qnie_lab     ## account to charge
#SBATCH -p highmem          ## partition/queue name
#SBATCH --nodes=2            ## (-N) number of nodes to use
#SBATCH --ntasks=2           ## (-n) number of tasks to launch
#SBATCH --cpus-per-task=20    ## number of cores the job needs
#SBATCH --error=slurm-%J.err ## error log file

# Run cellranger count on reads. Index was filtered in cellrangerfiltergtf.sub and built in cellrangermkref.sub.
module load cellranger/3.1.0

cellranger count --id=cellrangercomparison_F1 --fastqs=/pub/edolling/GenPALSReadAlignerComparison/Data/GliPtch6wtam/F1Reads \
--transcriptome=/pub/edolling/GenPALSReadAlignerComparison/Data/cellranger_ref_GRCm39.103 \

scp the files down to the desktop.

#eg
# scp edolling@hpc3.rcic.uci.edu:/pub/edolling/GenPALSReadAlignerComparison/Code/cellrangercomparison_F2/outs/raw_feature_bc_matrix.h5 ./

Part 2: kallisto code.

As specified above, all code in this section was run on a desktop with 10 cores. Also notice that the reads and the files I used to build the index are the same as in part 1, but for kallisto I concatenated the reads just to make my life a little easier.

import os
!pwd
/Users/emmanueldollinger/Documents/Projects/GenPALS/Code
!kb ref -i ../Data/transcriptome.idx -g ../Data/transcripts_to_genes.txt -f1 ../Data/cdna.fa \
~/Documents/Projects/BulkRNAseq/Data/IndexFilesGRCm39.103/Mus_musculus.GRCm39.dna.primary_assembly.fa.gz \
~/Documents/Projects/BulkRNAseq/Data/IndexFilesGRCm39.103/Mus_musculus.GRCm39.103.filtered.gtf
[2021-03-29 11:05:53,300]    INFO Preparing /Users/emmanueldollinger/Documents/Projects/BulkRNAseq/Data/IndexFilesGRCm39.103/Mus_musculus.GRCm39.dna.primary_assembly.fa.gz, /Users/emmanueldollinger/Documents/Projects/BulkRNAseq/Data/IndexFilesGRCm39.103/Mus_musculus.GRCm39.103.filtered.gtf
[2021-03-29 11:05:53,300]    INFO Creating transcript-to-gene mapping at /Users/emmanueldollinger/Documents/Projects/GenPALS/Code/tmp/tmp2mjuur6u
[2021-03-29 11:06:24,915]    INFO Decompressing /Users/emmanueldollinger/Documents/Projects/BulkRNAseq/Data/IndexFilesGRCm39.103/Mus_musculus.GRCm39.dna.primary_assembly.fa.gz to tmp
[2021-03-29 11:06:38,892]    INFO Sorting tmp/Mus_musculus.GRCm39.dna.primary_assembly.fa to /Users/emmanueldollinger/Documents/Projects/GenPALS/Code/tmp/tmp179k8x_7
[2021-03-29 11:12:21,963]    INFO Sorting /Users/emmanueldollinger/Documents/Projects/BulkRNAseq/Data/IndexFilesGRCm39.103/Mus_musculus.GRCm39.103.filtered.gtf to /Users/emmanueldollinger/Documents/Projects/GenPALS/Code/tmp/tmp3ez7oqdu
[2021-03-29 11:13:19,371]    INFO Splitting genome tmp/Mus_musculus.GRCm39.dna.primary_assembly.fa into cDNA at /Users/emmanueldollinger/Documents/Projects/GenPALS/Code/tmp/tmpnvnlj06p
[2021-03-29 11:13:19,371] WARNING The following chromosomes were found in the FASTA but does not have any "transcript" features in the GTF: GL456382.1, GL456372.1, MU069434.1, GL456239.1, JH584301.1, GL456394.1, JH584302.1, GL456381.1, MU069435.1, GL456379.1, GL456367.1, GL456383.1, GL456387.1, GL456368.1, GL456389.1, JH584300.1, GL456359.1, GL456396.1, GL456392.1, GL456370.1, GL456390.1, GL456378.1, GL456385.1, GL456366.1, GL456360.1. No sequences will be generated for these chromosomes.
[2021-03-29 11:14:06,971]    INFO Wrote 140725 cDNA transcripts
[2021-03-29 11:14:06,976]    INFO Concatenating 1 transcript-to-gene mappings to ../Data/transcripts_to_genes.txt
[2021-03-29 11:14:07,050]    INFO Concatenating 1 cDNAs to ../Data/cdna.fa
[2021-03-29 11:14:07,729]    INFO Indexing ../Data/cdna.fa to ../Data/transcriptome.idx

Now concat the reads together (I’m assuming no batch effects between runs here).

!mkdir ../Data/CatFiles
!cat ../Data/Gli\ Ptch\ 6w\ tam\ scSeq/KW_6W_F1*R1*.fastq.gz > ../Data/CatFiles/F1_R1.fastq.gz
!cat ../Data/Gli\ Ptch\ 6w\ tam\ scSeq/KW_6W_F1*R2*.fastq.gz > ../Data/CatFiles/F1_R2.fastq.gz

!cat ../Data/Gli\ Ptch\ 6w\ tam\ scSeq/KW_6W_F2*R1*.fastq.gz > ../Data/CatFiles/F2_R1.fastq.gz
!cat ../Data/Gli\ Ptch\ 6w\ tam\ scSeq/KW_6W_F2*R2*.fastq.gz > ../Data/CatFiles/F2_R2.fastq.gz
!ls ../Data/CatFiles/
F1_R1.fastq.gz F1_R2.fastq.gz F2_R1.fastq.gz F2_R2.fastq.gz

Run kallisto on F1 and F2.

!kb count --h5ad -i ../Data/transcriptome.idx \
-g ../Data/transcripts_to_genes.txt -x 10XV3 \
-o ../Data/kallisto_align_F1 --filter bustools --overwrite -t 10 ../Data/CatFiles/F1_R1.fastq.gz ../Data/CatFiles/F1_R2.fastq.gz
[2021-03-29 12:11:37,260]    INFO Using index ../Data/transcriptome.idx to generate BUS file to ../Data/kallisto_align_F1 from
[2021-03-29 12:11:37,260]    INFO         ../Data/CatFiles/F1_R1.fastq.gz
[2021-03-29 12:11:37,260]    INFO         ../Data/CatFiles/F1_R2.fastq.gz
[2021-03-29 12:16:42,491]    INFO Sorting BUS file ../Data/kallisto_align_F1/output.bus to ../Data/kallisto_align_F1/tmp/output.s.bus
[2021-03-29 12:17:40,312]    INFO Whitelist not provided
[2021-03-29 12:17:40,312]    INFO Copying pre-packaged 10XV3 whitelist to ../Data/kallisto_align_F1
[2021-03-29 12:17:40,904]    INFO Inspecting BUS file ../Data/kallisto_align_F1/tmp/output.s.bus
[2021-03-29 12:18:17,413]    INFO Correcting BUS records in ../Data/kallisto_align_F1/tmp/output.s.bus to ../Data/kallisto_align_F1/tmp/output.s.c.bus with whitelist ../Data/kallisto_align_F1/10xv3_whitelist.txt
[2021-03-29 12:18:50,875]    INFO Sorting BUS file ../Data/kallisto_align_F1/tmp/output.s.c.bus to ../Data/kallisto_align_F1/output.unfiltered.bus
[2021-03-29 12:19:27,397]    INFO Generating count matrix ../Data/kallisto_align_F1/counts_unfiltered/cells_x_genes from BUS file ../Data/kallisto_align_F1/output.unfiltered.bus
[2021-03-29 12:20:20,001]    INFO Reading matrix ../Data/kallisto_align_F1/counts_unfiltered/cells_x_genes.mtx
[2021-03-29 12:20:47,966]    INFO Writing matrix to h5ad ../Data/kallisto_align_F1/counts_unfiltered/adata.h5ad
/Users/emmanueldollinger/miniconda3/lib/python3.7/site-packages/anndata/_core/anndata.py:1192: FutureWarning: is_categorical is deprecated and will be removed in a future version.  Use is_categorical_dtype instead
  if is_string_dtype(df[key]) and not is_categorical(df[key])
... storing 'gene_name' as categorical
[2021-03-29 12:20:49,227]    INFO Filtering with bustools
[2021-03-29 12:20:49,227]    INFO Generating whitelist ../Data/kallisto_align_F1/filter_barcodes.txt from BUS file ../Data/kallisto_align_F1/output.unfiltered.bus
[2021-03-29 12:20:50,233]    INFO Correcting BUS records in ../Data/kallisto_align_F1/output.unfiltered.bus to ../Data/kallisto_align_F1/tmp/output.unfiltered.c.bus with whitelist ../Data/kallisto_align_F1/filter_barcodes.txt
[2021-03-29 12:21:14,047]    INFO Sorting BUS file ../Data/kallisto_align_F1/tmp/output.unfiltered.c.bus to ../Data/kallisto_align_F1/output.filtered.bus
[2021-03-29 12:21:43,460]    INFO Generating count matrix ../Data/kallisto_align_F1/counts_filtered/cells_x_genes from BUS file ../Data/kallisto_align_F1/output.filtered.bus
[2021-03-29 12:22:15,664]    INFO Reading matrix ../Data/kallisto_align_F1/counts_filtered/cells_x_genes.mtx
[2021-03-29 12:22:31,402]    INFO Writing matrix to h5ad ../Data/kallisto_align_F1/counts_filtered/adata.h5ad
... storing 'gene_name' as categorical
!kb count --h5ad -i ../Data/transcriptome.idx \
-g ../Data/transcripts_to_genes.txt -x 10XV3 \
-o ../Data/kallisto_align_F2 --filter bustools --overwrite -t 10 ../Data/CatFiles/F2_R1.fastq.gz ../Data/CatFiles/F2_R2.fastq.gz
[2021-03-29 12:22:33,429]    INFO Using index ../Data/transcriptome.idx to generate BUS file to ../Data/kallisto_align_F2 from
[2021-03-29 12:22:33,429]    INFO         ../Data/CatFiles/F2_R1.fastq.gz
[2021-03-29 12:22:33,429]    INFO         ../Data/CatFiles/F2_R2.fastq.gz
[2021-03-29 12:27:43,257]    INFO Sorting BUS file ../Data/kallisto_align_F2/output.bus to ../Data/kallisto_align_F2/tmp/output.s.bus
[2021-03-29 12:28:30,055]    INFO Whitelist not provided
[2021-03-29 12:28:30,055]    INFO Copying pre-packaged 10XV3 whitelist to ../Data/kallisto_align_F2
[2021-03-29 12:28:30,620]    INFO Inspecting BUS file ../Data/kallisto_align_F2/tmp/output.s.bus
[2021-03-29 12:28:59,069]    INFO Correcting BUS records in ../Data/kallisto_align_F2/tmp/output.s.bus to ../Data/kallisto_align_F2/tmp/output.s.c.bus with whitelist ../Data/kallisto_align_F2/10xv3_whitelist.txt
[2021-03-29 12:29:27,351]    INFO Sorting BUS file ../Data/kallisto_align_F2/tmp/output.s.c.bus to ../Data/kallisto_align_F2/output.unfiltered.bus
[2021-03-29 12:29:55,576]    INFO Generating count matrix ../Data/kallisto_align_F2/counts_unfiltered/cells_x_genes from BUS file ../Data/kallisto_align_F2/output.unfiltered.bus
[2021-03-29 12:30:33,051]    INFO Reading matrix ../Data/kallisto_align_F2/counts_unfiltered/cells_x_genes.mtx
[2021-03-29 12:30:53,474]    INFO Writing matrix to h5ad ../Data/kallisto_align_F2/counts_unfiltered/adata.h5ad
/Users/emmanueldollinger/miniconda3/lib/python3.7/site-packages/anndata/_core/anndata.py:1192: FutureWarning: is_categorical is deprecated and will be removed in a future version.  Use is_categorical_dtype instead
  if is_string_dtype(df[key]) and not is_categorical(df[key])
... storing 'gene_name' as categorical
[2021-03-29 12:30:54,842]    INFO Filtering with bustools
[2021-03-29 12:30:54,842]    INFO Generating whitelist ../Data/kallisto_align_F2/filter_barcodes.txt from BUS file ../Data/kallisto_align_F2/output.unfiltered.bus
[2021-03-29 12:30:55,564]    INFO Correcting BUS records in ../Data/kallisto_align_F2/output.unfiltered.bus to ../Data/kallisto_align_F2/tmp/output.unfiltered.c.bus with whitelist ../Data/kallisto_align_F2/filter_barcodes.txt
[2021-03-29 12:31:11,453]    INFO Sorting BUS file ../Data/kallisto_align_F2/tmp/output.unfiltered.c.bus to ../Data/kallisto_align_F2/output.filtered.bus
[2021-03-29 12:31:30,731]    INFO Generating count matrix ../Data/kallisto_align_F2/counts_filtered/cells_x_genes from BUS file ../Data/kallisto_align_F2/output.filtered.bus
[2021-03-29 12:31:48,377]    INFO Reading matrix ../Data/kallisto_align_F2/counts_filtered/cells_x_genes.mtx
[2021-03-29 12:31:56,373]    INFO Writing matrix to h5ad ../Data/kallisto_align_F2/counts_filtered/adata.h5ad
... storing 'gene_name' as categorical

Part 3: Comparison

Load all of the data.

import chunk

import os
import pandas as pd
import scanpy as sc
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
F1_cellranger =  sc.read_10x_h5('../Data/raw_feature_bc_matrixF1.h5')
F1_cellranger.var_names_make_unique()

F2_cellranger =  sc.read_10x_h5('../Data/raw_feature_bc_matrixF2.h5')
F2_cellranger.var_names_make_unique()
Variable names are not unique. To make them unique, call `.var_names_make_unique`.
Variable names are not unique. To make them unique, call `.var_names_make_unique`.
Variable names are not unique. To make them unique, call `.var_names_make_unique`.
Variable names are not unique. To make them unique, call `.var_names_make_unique`.
F1_kallisto = sc.read_h5ad('../Data/kallisto_align_F1/counts_unfiltered/adata.h5ad')
F1_kallisto.var_names_make_unique()

F2_kallisto = sc.read_h5ad('../Data/kallisto_align_F2/counts_unfiltered/adata.h5ad')
F2_kallisto.var_names_make_unique()

Clean up kallisto gene names.

F1_kallisto.var['gene_name'] = F1_kallisto.var['gene_name'].astype(str)
F2_kallisto.var['gene_name'] = F2_kallisto.var['gene_name'].astype(str)
F1_kallisto.var.set_index('gene_name', inplace = True)
F2_kallisto.var.set_index('gene_name', inplace = True)
F1_kallisto.var_names_make_unique()
F2_kallisto.var_names_make_unique()

First, let’s filter genes per cell and cells per gene:

F1_cellranger
AnnData object with n_obs × n_vars = 6794880 × 53700
    var: 'gene_ids', 'feature_types', 'genome'
ListOfAdata = [F1_cellranger, F1_kallisto, F2_cellranger, F2_kallisto]

for adata in ListOfAdata:

    sc.pp.filter_cells(adata, min_genes=200)
    sc.pp.filter_genes(adata, min_cells=3)
    sc.pp.calculate_qc_metrics(adata, percent_top=None, log1p=False, inplace=True)
F1_cellranger
AnnData object with n_obs × n_vars = 7964 × 21150
    obs: 'n_genes', 'n_genes_by_counts', 'total_counts'
    var: 'gene_ids', 'feature_types', 'genome', 'n_cells', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts'
F1_kallisto
AnnData object with n_obs × n_vars = 7977 × 27046
    obs: 'n_genes', 'n_genes_by_counts', 'total_counts'
    var: 'n_cells', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts'
F2_cellranger
AnnData object with n_obs × n_vars = 5515 × 20220
    obs: 'n_genes', 'n_genes_by_counts', 'total_counts'
    var: 'gene_ids', 'feature_types', 'genome', 'n_cells', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts'
F2_kallisto
AnnData object with n_obs × n_vars = 5470 × 25936
    obs: 'n_genes', 'n_genes_by_counts', 'total_counts'
    var: 'n_cells', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts'

Save anndatas.

F1_cellranger.write_h5ad('../Data/F1_cellranger.h5ad')
F2_cellranger.write_h5ad('../Data/F2_cellranger.h5ad')
F1_kallisto.write_h5ad('../Data/F1_kallisto.h5ad')
F2_kallisto.write_h5ad('../Data/F2_kallisto.h5ad')
... storing 'feature_types' as categorical
... storing 'genome' as categorical
... storing 'feature_types' as categorical
... storing 'genome' as categorical

Plot metrics of number of cells and number of counts

dfMetrics = pd.DataFrame()

dfMetrics["Batch"] = np.array([np.repeat(x, 2) for x in np.unique(obs["Batch"])]).flatten()

dfMetrics["Metric"] = np.tile(['NumberOfCells','NumberOfCounts'], 4)

dfMetrics["Number"] = np.array([7960,21150,7977,27046,5512,20220,5470,25936])
plt.figure(figsize=(10, 10))

sns.set(font_scale=2)

fig = sns.barplot(data = dfMetrics, x = 'Metric', y='Number', hue = 'Batch', palette = ['#ad2828', '#13743c','#fa3a3a','#60c000'], hue_order = ['F1_cellranger','F1_kallisto','F2_cellranger','F2_kallisto'])

# plt.xscale("log")
# plt.xlabel("Total counts of transcriptic counts per cell.")

plt.savefig('/Users/emmanueldollinger/Desktop/Metrics.pdf', dpi=300, transparent = False)

png

Plot transcripts/cell and genes/cell

F1_cellranger.obs['Mouse'] = 'F1'
F1_cellranger.obs['Aligner'] = 'cellranger'

F2_cellranger.obs['Mouse'] = 'F2'
F2_cellranger.obs['Aligner'] = 'cellranger'

F1_kallisto.obs['Mouse'] = 'F1'
F1_kallisto.obs['Aligner'] = 'kallisto'

F2_kallisto.obs['Mouse'] = 'F2'
F2_kallisto.obs['Aligner'] = 'kallisto'
obs = F1_cellranger.obs
obs = obs.append(F2_cellranger.obs)
obs = obs.append(F1_kallisto.obs)
obs = obs.append(F2_kallisto.obs)
obs
n_genes n_genes_by_counts total_counts Mouse Aligner
AAACCCAAGATGGGCT-1 250 250 444.0 F1 cellranger
AAACCCAAGCTCGTGC-1 2244 2244 6383.0 F1 cellranger
AAACCCACAATCTAGC-1 4334 4330 21088.0 F1 cellranger
AAACCCACAGTTGAAA-1 227 227 568.0 F1 cellranger
AAACCCACATGGGAAC-1 757 757 1291.0 F1 cellranger
... ... ... ... ... ...
TTTGTTGCAAGAAACT 414 414 1021.0 F2 kallisto
TTTGTTGCATGCCGGT 342 342 413.0 F2 kallisto
TTTGTTGGTCAAAGAT 2259 2259 5094.0 F2 kallisto
TTTGTTGGTGTTTACG 5749 5743 23113.0 F2 kallisto
TTTGTTGTCACTTTGT 1591 1588 3226.0 F2 kallisto

26926 rows × 5 columns

obs["Batch"] = obs["Mouse"] + '_' + obs["Aligner"]
plt.figure(figsize=(10, 10))
sns.set(font_scale=2)

fig = sns.ecdfplot(data = obs, x = 'total_counts', hue = 'Batch', palette = ['#ad2828', '#13743c','#fa3a3a','#60c000'], hue_order = ['F1_cellranger','F1_kallisto','F2_cellranger','F2_kallisto'], linewidth = 3)

plt.xscale("log")
plt.xlabel("Counts per cell.")

plt.savefig('/Users/emmanueldollinger/Desktop/CountsPerCell.pdf', dpi=300, transparent = False)

png

plt.figure(figsize=(10, 10))
sns.set(font_scale=2)

fig = sns.ecdfplot(data = obs, x = 'n_genes', hue = 'Batch', palette = ['#ad2828', '#13743c','#fa3a3a','#60c000'], hue_order = ['F1_cellranger','F1_kallisto','F2_cellranger','F2_kallisto'], linewidth = 3)

plt.xscale("log")
plt.xlabel("Genes per cell.")

plt.savefig('/Users/emmanueldollinger/Desktop/GenesPerCell.pdf', dpi=300, transparent = False)

png

040121

Now, let’s ask the question if we look at the union of the genes, how do the clusters change? No batch correction.

F1_cellranger.obs["Batch"] = F1_cellranger.obs["Mouse"].values + '_' + F1_cellranger.obs["Aligner"].values

F1_kallisto.obs["Batch"] = F1_kallisto.obs["Mouse"].values + '_' + F1_kallisto.obs["Aligner"].values

F2_cellranger.obs["Batch"] = F2_cellranger.obs["Mouse"].values + '_' + F2_cellranger.obs["Aligner"].values

F2_kallisto.obs["Batch"] = F2_kallisto.obs["Mouse"].values + '_' + F2_kallisto.obs["Aligner"].values
del AllCells
AllCells = F1_cellranger.concatenate(F1_kallisto, F2_cellranger, F2_kallisto)
AllCells
AnnData object with n_obs × n_vars = 26926 × 19093
    obs: 'n_genes', 'n_genes_by_counts', 'total_counts', 'Mouse', 'Aligner', 'Batch', 'batch'
    var: 'gene_ids-0', 'feature_types-0', 'genome-0', 'n_cells-0', 'n_cells_by_counts-0', 'mean_counts-0', 'pct_dropout_by_counts-0', 'total_counts-0', 'n_cells-1', 'n_cells_by_counts-1', 'mean_counts-1', 'pct_dropout_by_counts-1', 'total_counts-1', 'gene_ids-2', 'feature_types-2', 'genome-2', 'n_cells-2', 'n_cells_by_counts-2', 'mean_counts-2', 'pct_dropout_by_counts-2', 'total_counts-2', 'n_cells-3', 'n_cells_by_counts-3', 'mean_counts-3', 'pct_dropout_by_counts-3', 'total_counts-3'

As opposed to the scanpy docs, I store the counts in raw (not the log1p transformed counts).

AllCells.raw = AllCells

Filter mt cells.

AllCells.var['mt'] = AllCells.var_names.str.startswith('MT-')  # annotate the group of mitochondrial genes as 'mt'
sc.pp.calculate_qc_metrics(AllCells, qc_vars=['mt'], percent_top=None, log1p=False, inplace=True)
AllCells = AllCells[AllCells.obs.pct_counts_mt < 5, :]

Typical preprocessing.

AllCells
View of AnnData object with n_obs × n_vars = 26926 × 19093
    obs: 'n_genes', 'n_genes_by_counts', 'total_counts', 'Mouse', 'Aligner', 'Batch', 'batch', 'total_counts_mt', 'pct_counts_mt'
    var: 'gene_ids-0', 'feature_types-0', 'genome-0', 'n_cells-0', 'n_cells_by_counts-0', 'mean_counts-0', 'pct_dropout_by_counts-0', 'total_counts-0', 'n_cells-1', 'n_cells_by_counts-1', 'mean_counts-1', 'pct_dropout_by_counts-1', 'total_counts-1', 'gene_ids-2', 'feature_types-2', 'genome-2', 'n_cells-2', 'n_cells_by_counts-2', 'mean_counts-2', 'pct_dropout_by_counts-2', 'total_counts-2', 'n_cells-3', 'n_cells_by_counts-3', 'mean_counts-3', 'pct_dropout_by_counts-3', 'total_counts-3', 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts'
sc.pp.normalize_total(AllCells, target_sum=1e4)
sc.pp.log1p(AllCells)

sc.pp.highly_variable_genes(AllCells, min_mean=0.0125, max_mean=3, min_disp=0.5)
sc.pp.scale(AllCells, max_value=10)
/Users/emmanueldollinger/miniconda3/lib/python3.7/site-packages/scanpy/preprocessing/_normalization.py:138: UserWarning: Revieved a view of an AnnData. Making a copy.
  view_to_actual(adata)
sc.tl.pca(AllCells, svd_solver='arpack')
sc.pp.neighbors(AllCells, n_neighbors=10, n_pcs=40, metric='cosine')
sc.tl.umap(AllCells)
sc.settings.set_figure_params(dpi=80, facecolor='white', figsize=(10,10))
fig = sc.pl.umap(AllCells, color=['Batch'], palette = ['#940505', '#13743c','#fb8d8d','#60c000'], size = 10, save = 'UMAPbyBatch.pdf') #ad2828
WARNING: saving figure to file figures/umapUMAPbyBatch.pdf

png

This is probably due to # of transcripts, AnnData concat does the intersection of genes.

?ad.AnnData.concatenate
Signature:
ad.AnnData.concatenate(
    self,
    *adatas: 'AnnData',
    join: str = 'inner',
    batch_key: str = 'batch',
    batch_categories: Sequence[Any] = None,
    uns_merge: Union[str, NoneType] = None,
    index_unique: Union[str, NoneType] = '-',
    fill_value=None,
) -> 'AnnData'
Docstring:
Concatenate along the observations axis.

The :attr:`uns`, :attr:`varm` and :attr:`obsm` attributes are ignored.

Currently, this works only in `'memory'` mode.

Parameters
----------
adatas
    AnnData matrices to concatenate with. Each matrix is referred to as
    a “batch”.
join
    Use intersection (`'inner'`) or union (`'outer'`) of variables.
batch_key
    Add the batch annotation to :attr:`obs` using this key.
batch_categories
    Use these as categories for the batch annotation. By default, use increasing numbers.
uns_merge
    Strategy to use for merging entries of uns. These strategies are applied recusivley.
    Currently implemented strategies include:

    * `None`: The default. The concatenated object will just have an empty dict for `uns`.
    * `"same"`: Only entries which have the same value in all AnnData objects are kept.
    * `"unique"`: Only entries which have one unique value in all AnnData objects are kept.
    * `"first"`: The first non-missing value is used.
    * `"only"`: A value is included if only one of the AnnData objects has a value at this
      path.
index_unique
    Make the index unique by joining the existing index names with the
    batch category, using `index_unique='-'`, for instance. Provide
    `None` to keep existing indices.
fill_value
    Scalar value to fill newly missing values in arrays with. Note: only applies to arrays
    and sparse matrices (not dataframes) and will only be used if `join="outer"`.

    .. note::
        If not provided, the default value is `0` for sparse matrices and `np.nan`
        for numpy arrays. See the examples below for more information.

Returns
-------
:class:`~anndata.AnnData`
    The concatenated :class:`~anndata.AnnData`, where `adata.obs[batch_key]`
    stores a categorical variable labeling the batch.

Notes
-----

.. warning::

   If you use `join='outer'` this fills 0s for sparse data when
   variables are absent in a batch. Use this with care. Dense data is
   filled with `NaN`. See the examples.

Examples
--------
Joining on intersection of variables.

>>> adata1 = AnnData(
...     np.array([[1, 2, 3], [4, 5, 6]]),
...     dict(obs_names=['s1', 's2'], anno1=['c1', 'c2']),
...     dict(var_names=['a', 'b', 'c'], annoA=[0, 1, 2]),
... )
>>> adata2 = AnnData(
...     np.array([[1, 2, 3], [4, 5, 6]]),
...     dict(obs_names=['s3', 's4'], anno1=['c3', 'c4']),
...     dict(var_names=['d', 'c', 'b'], annoA=[0, 1, 2]),
... )
>>> adata3 = AnnData(
... np.array([[1, 2, 3], [4, 5, 6]]),
...     dict(obs_names=['s1', 's2'], anno2=['d3', 'd4']),
...     dict(var_names=['d', 'c', 'b'], annoA=[0, 2, 3], annoB=[0, 1, 2]),
... )
>>> adata = adata1.concatenate(adata2, adata3)
>>> adata
AnnData object with n_obs × n_vars = 6 × 2
    obs: 'anno1', 'anno2', 'batch'
    var: 'annoA-0', 'annoA-1', 'annoA-2', 'annoB-2'
>>> adata.X
array([[2., 3.],
       [5., 6.],
       [3., 2.],
       [6., 5.],
       [3., 2.],
       [6., 5.]], dtype=float32)
>>> adata.obs
     anno1 anno2 batch
s1-0    c1   NaN     0
s2-0    c2   NaN     0
s3-1    c3   NaN     1
s4-1    c4   NaN     1
s1-2   NaN    d3     2
s2-2   NaN    d4     2
>>> adata.var.T
         b  c
annoA-0  1  2
annoA-1  2  1
annoA-2  3  2
annoB-2  2  1

Joining on the union of variables.

>>> outer = adata1.concatenate(adata2, adata3, join='outer')
>>> outer
AnnData object with n_obs × n_vars = 6 × 4
    obs: 'anno1', 'anno2', 'batch'
    var: 'annoA-0', 'annoA-1', 'annoA-2', 'annoB-2'
>>> outer.var.T
           a    b    c    d
annoA-0  0.0  1.0  2.0  NaN
annoA-1  NaN  2.0  1.0  0.0
annoA-2  NaN  3.0  2.0  0.0
annoB-2  NaN  2.0  1.0  0.0
>>> outer.var_names
Index(['a', 'b', 'c', 'd'], dtype='object')
>>> outer.X
array([[ 1.,  2.,  3., nan],
       [ 4.,  5.,  6., nan],
       [nan,  3.,  2.,  1.],
       [nan,  6.,  5.,  4.],
       [nan,  3.,  2.,  1.],
       [nan,  6.,  5.,  4.]], dtype=float32)
>>> outer.X.sum(axis=0)
array([nan, 25., 23., nan], dtype=float32)
>>> import pandas as pd
>>> Xdf = pd.DataFrame(outer.X, columns=outer.var_names)
>>> Xdf
     a    b    c    d
0  1.0  2.0  3.0  NaN
1  4.0  5.0  6.0  NaN
2  NaN  3.0  2.0  1.0
3  NaN  6.0  5.0  4.0
4  NaN  3.0  2.0  1.0
5  NaN  6.0  5.0  4.0
>>> Xdf.sum()
a     5.0
b    25.0
c    23.0
d    10.0
dtype: float32

One way to deal with missing values is to use masked arrays:

>>> from numpy import ma
>>> outer.X = ma.masked_invalid(outer.X)
>>> outer.X
masked_array(
  data=[[1.0, 2.0, 3.0, --],
        [4.0, 5.0, 6.0, --],
        [--, 3.0, 2.0, 1.0],
        [--, 6.0, 5.0, 4.0],
        [--, 3.0, 2.0, 1.0],
        [--, 6.0, 5.0, 4.0]],
  mask=[[False, False, False,  True],
        [False, False, False,  True],
        [ True, False, False, False],
        [ True, False, False, False],
        [ True, False, False, False],
        [ True, False, False, False]],
  fill_value=1e+20,
  dtype=float32)
>>> outer.X.sum(axis=0).data
array([ 5., 25., 23., 10.], dtype=float32)

The masked array is not saved but has to be reinstantiated after saving.

>>> outer.write('./test.h5ad')
>>> from anndata import read_h5ad
>>> outer = read_h5ad('./test.h5ad')
>>> outer.X
array([[ 1.,  2.,  3., nan],
       [ 4.,  5.,  6., nan],
       [nan,  3.,  2.,  1.],
       [nan,  6.,  5.,  4.],
       [nan,  3.,  2.,  1.],
       [nan,  6.,  5.,  4.]], dtype=float32)

For sparse data, everything behaves similarly,
except that for `join='outer'`, zeros are added.

>>> from scipy.sparse import csr_matrix
>>> adata1 = AnnData(
...     csr_matrix([[0, 2, 3], [0, 5, 6]]),
...     dict(obs_names=['s1', 's2'], anno1=['c1', 'c2']),
...     dict(var_names=['a', 'b', 'c']),
... )
>>> adata2 = AnnData(
... csr_matrix([[0, 2, 3], [0, 5, 6]]),
...     dict(obs_names=['s3', 's4'], anno1=['c3', 'c4']),
...     dict(var_names=['d', 'c', 'b']),
... )
>>> adata3 = AnnData(
... csr_matrix([[1, 2, 0], [0, 5, 6]]),
...     dict(obs_names=['s5', 's6'], anno2=['d3', 'd4']),
...     dict(var_names=['d', 'c', 'b']),
... )
>>> adata = adata1.concatenate(adata2, adata3, join='outer')
>>> adata.var_names
Index(['a', 'b', 'c', 'd'], dtype='object')
>>> adata.X.toarray()
array([[0., 2., 3., 0.],
       [0., 5., 6., 0.],
       [0., 3., 2., 0.],
       [0., 6., 5., 0.],
       [0., 0., 2., 1.],
       [0., 6., 5., 0.]], dtype=float32)
File:      ~/miniconda3/lib/python3.7/site-packages/anndata/_core/anndata.py
Type:      function
sc.settings.set_figure_params(dpi=80, facecolor='white', figsize=(10,10))
fig = sc.pl.pca(AllCells, color=['Batch'], palette = ['#940505', '#13743c','#fb8d8d','#60c000'], size = 10, save = 'PCAbyBatch.pdf')
WARNING: saving figure to file figures/pcaPCAbyBatch.pdf

png

sc.settings.set_figure_params(dpi=80, facecolor='white', figsize=(10,10))
fig = sc.pl.pca(AllCells, color=['Batch'], palette = ['#940505', '#13743c','#fb8d8d','#60c000'], size = 10, save = 'PCAbyBatch34.pdf', components = ['3,4'])
WARNING: saving figure to file figures/pcaPCAbyBatch34.pdf

png

Cell annotation:

AllCells.obs['total_counts_log'] = np.log10(AllCells.obs['total_counts'])
AllCells.obs['n_genes_log'] = np.log10(AllCells.obs['n_genes'])
sc.pl.umap(AllCells, color=['Ptprc', 'Krt14', 'Krt5', 'Krt17', 'Enpp2', 'total_counts_log', 'n_genes_log'], size = 10, use_raw=False, ncols =2)

png

sc.pl.umap(AllCells, color=['total_counts_log', 'Batch'], size = 10, use_raw=False, ncols =2, save='TotalCountsAndNGenes.pdf')
WARNING: saving figure to file figures/umapTotalCountsAndNGenes.pdf

png

sc.pl.umap(AllCells, color=['n_genes_log', 'Batch'], size = 10, use_raw=False, ncols =2, save='NGenes.pdf')
WARNING: saving figure to file figures/umapNGenes.pdf

png

F1_cellranger = AllCells[AllCells.obs["Batch"] == "F1_cellranger",:]
F2_cellranger = AllCells[AllCells.obs["Batch"] == "F2_cellranger",:]
sc.pl.umap(F1_cellranger, color=['total_counts_log', 'n_genes_log'], size = 10, use_raw=False, ncols =2)
sc.pl.umap(F2_cellranger, color=['total_counts_log', 'n_genes_log'], size = 10, use_raw=False, ncols =2)

png

png

AllCells_totalcountsSubset = AllCells[AllCells.obs["total_counts_log"] < 3,:]
sc.pl.umap(AllCells_totalcountsSubset, color=['total_counts_log', 'n_genes_log'], size = 10, use_raw=False, ncols =2)

png

Subset just ICs:

ICs = AllCells[AllCells[:,"Ptprc"].X > 0,:]
ICs_raw = ICs.raw.to_adata()
sc.pp.normalize_total(ICs_raw, target_sum=1e4)
sc.pp.log1p(ICs_raw)

sc.pp.highly_variable_genes(ICs_raw, min_mean=0.0125, max_mean=3, min_disp=0.5)
sc.pp.scale(ICs_raw, max_value=10)
WARNING: adata.X seems to be already log-transformed.
sc.tl.pca(ICs_raw, svd_solver='arpack')
sc.pp.neighbors(ICs_raw, n_neighbors=10, n_pcs=40, metric='cosine')
sc.tl.umap(ICs_raw)
sc.settings.set_figure_params(dpi=80, facecolor='white', figsize=(10,10))
sc.pl.umap(ICs_raw, color=['Batch'], palette = ['#940505', '#13743c','#fb8d8d','#60c000'], size = 20, save = 'UMAPbyBatch_ICs.pdf')
WARNING: saving figure to file figures/umapUMAPbyBatch_ICs.pdf

png

sc.pl.umap(ICs_raw, color=['Batch', 'Cd3e'], palette = ['#940505', '#13743c','#fb8d8d','#60c000'], size = 20, save = 'UMAPbyBatch_ICs.pdf')
WARNING: saving figure to file figures/umapUMAPbyBatch_ICs.pdf

png

Krts = AllCells[AllCells[:,"Krt14"].X > 0, :]
ICs = AllCells[AllCells[:,"Ptprc"].X > 0, :]
FBs = AllCells[AllCells[:,"Enpp2"].X > 0, :]
Krt14 = np.array(Krts[:,"Krt14"].X).flatten()
Ptprc = np.array(ICs[:,"Ptprc"].X).flatten()
Enpp2 = np.array(FBs[:,"Enpp2"].X).flatten()
Krtsdf = pd.DataFrame()
Krtsdf["Krt14"] = Krt14
Krtsdf["Batch"] = Krts.obs["Batch"].values
ICsdf = pd.DataFrame()
ICsdf["Ptprc"] = Ptprc
ICsdf["Batch"] = ICs.obs["Batch"].values
FBsdf = pd.DataFrame()
FBsdf["Enpp2"] = Enpp2
FBsdf["Batch"] = FBs.obs["Batch"].values
plt.figure(figsize=(10,5))

plt.subplot(1,3,1)

vlnKrts = sns.violinplot(data = Krtsdf, x = 'Batch', y = 'Krt14', palette = ['#940505', '#13743c','#fb8d8d','#60c000'])
vlnKrts.set_xticklabels(labels = np.unique(Krtsdf['Batch']), rotation = 45, ha= 'right')

plt.subplot(1,3,2)

vlnICs = sns.violinplot(data = ICsdf, x = 'Batch', y = 'Ptprc', palette = ['#940505', '#13743c','#fb8d8d','#60c000'])
vlnICs.set_xticklabels(labels = np.unique(ICsdf['Batch']), rotation = 45, ha= 'right')

plt.subplot(1,3,3)

vlnFBs = sns.violinplot(data = FBsdf, x = 'Batch', y = 'Enpp2', palette = ['#940505', '#13743c','#fb8d8d','#60c000'])
vlnFBs.set_xticklabels(labels = np.unique(FBsdf['Batch']), rotation = 45, ha= 'right')

plt.savefig('/Users/emmanueldollinger/Desktop/DiffGenes.pdf', dpi=300, transparent = False)

png

Let’s see if I can batch correct out the aligners. Using harmony:

import scanpy.external as sce
AllCells
AnnData object with n_obs × n_vars = 26926 × 19093
    obs: 'n_genes', 'n_genes_by_counts', 'total_counts', 'Mouse', 'Aligner', 'Batch', 'batch', 'total_counts_mt', 'pct_counts_mt', 'total_counts_log', 'n_genes_log'
    var: 'gene_ids-0', 'feature_types-0', 'genome-0', 'n_cells-0', 'n_cells_by_counts-0', 'mean_counts-0', 'pct_dropout_by_counts-0', 'total_counts-0', 'n_cells-1', 'n_cells_by_counts-1', 'mean_counts-1', 'pct_dropout_by_counts-1', 'total_counts-1', 'gene_ids-2', 'feature_types-2', 'genome-2', 'n_cells-2', 'n_cells_by_counts-2', 'mean_counts-2', 'pct_dropout_by_counts-2', 'total_counts-2', 'n_cells-3', 'n_cells_by_counts-3', 'mean_counts-3', 'pct_dropout_by_counts-3', 'total_counts-3', 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'
    uns: 'log1p', 'hvg', 'pca', 'neighbors', 'umap', 'Batch_colors'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'
    obsp: 'distances', 'connectivities'
sce.pp.harmony_integrate(AllCells, key='Batch')
2021-04-04 14:42:52,989 - harmonypy - INFO - Iteration 1 of 10
2021-04-04 14:42:59,047 - harmonypy - INFO - Iteration 2 of 10
2021-04-04 14:43:05,159 - harmonypy - INFO - Iteration 3 of 10
2021-04-04 14:43:11,282 - harmonypy - INFO - Iteration 4 of 10
2021-04-04 14:43:13,433 - harmonypy - INFO - Iteration 5 of 10
2021-04-04 14:43:17,269 - harmonypy - INFO - Iteration 6 of 10
2021-04-04 14:43:19,164 - harmonypy - INFO - Iteration 7 of 10
2021-04-04 14:43:21,061 - harmonypy - INFO - Converged after 7 iterations
AllCells
AnnData object with n_obs × n_vars = 26926 × 19093
    obs: 'n_genes', 'n_genes_by_counts', 'total_counts', 'Mouse', 'Aligner', 'Batch', 'batch', 'total_counts_mt', 'pct_counts_mt', 'total_counts_log', 'n_genes_log'
    var: 'gene_ids-0', 'feature_types-0', 'genome-0', 'n_cells-0', 'n_cells_by_counts-0', 'mean_counts-0', 'pct_dropout_by_counts-0', 'total_counts-0', 'n_cells-1', 'n_cells_by_counts-1', 'mean_counts-1', 'pct_dropout_by_counts-1', 'total_counts-1', 'gene_ids-2', 'feature_types-2', 'genome-2', 'n_cells-2', 'n_cells_by_counts-2', 'mean_counts-2', 'pct_dropout_by_counts-2', 'total_counts-2', 'n_cells-3', 'n_cells_by_counts-3', 'mean_counts-3', 'pct_dropout_by_counts-3', 'total_counts-3', 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'
    uns: 'log1p', 'hvg', 'pca', 'neighbors', 'umap', 'Batch_colors'
    obsm: 'X_pca', 'X_umap', 'X_pca_harmony'
    varm: 'PCs'
    obsp: 'distances', 'connectivities'
sc.pp.neighbors(AllCells, n_neighbors=10, n_pcs=40, metric='cosine', use_rep='X_pca_harmony', key_added='neighbors_Harmony')
AllCells.obsm['X_umap_Original'] = AllCells.obsm['X_umap'].copy()
AllCells
AnnData object with n_obs × n_vars = 26926 × 19093
    obs: 'n_genes', 'n_genes_by_counts', 'total_counts', 'Mouse', 'Aligner', 'Batch', 'batch', 'total_counts_mt', 'pct_counts_mt', 'total_counts_log', 'n_genes_log'
    var: 'gene_ids-0', 'feature_types-0', 'genome-0', 'n_cells-0', 'n_cells_by_counts-0', 'mean_counts-0', 'pct_dropout_by_counts-0', 'total_counts-0', 'n_cells-1', 'n_cells_by_counts-1', 'mean_counts-1', 'pct_dropout_by_counts-1', 'total_counts-1', 'gene_ids-2', 'feature_types-2', 'genome-2', 'n_cells-2', 'n_cells_by_counts-2', 'mean_counts-2', 'pct_dropout_by_counts-2', 'total_counts-2', 'n_cells-3', 'n_cells_by_counts-3', 'mean_counts-3', 'pct_dropout_by_counts-3', 'total_counts-3', 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'
    uns: 'log1p', 'hvg', 'pca', 'neighbors', 'umap', 'Batch_colors'
    obsm: 'X_pca', 'X_umap', 'X_pca_harmony', 'X_umap_Original'
    varm: 'PCs'
    obsp: 'distances', 'connectivities'
sc.tl.umap(AllCells, neighbors_key='neighbors_Harmony')
sc.settings.set_figure_params(dpi=80, facecolor='white', figsize=(10,10))
fig = sc.pl.umap(AllCells, color=['Batch'], palette = ['#940505', '#13743c','#fb8d8d','#60c000'], size = 10, save = 'UMAPbyBatch.pdf', ncols=2, use_raw = False)
WARNING: saving figure to file figures/umapUMAPbyBatch.pdf

png

sc.settings.set_figure_params(dpi=80, facecolor='white', figsize=(10,10))
fig = sc.pl.umap(AllCells, color=['Krt14','Ptprc','Enpp2'], palette = ['#940505', '#13743c','#fb8d8d','#60c000'], size = 10, save = 'UMAPgenes.pdf', ncols=2, use_raw = False) #ad2828
WARNING: saving figure to file figures/umapUMAPgenes.pdf

png

Calculate the difference between aligners in medians of each gene.

%%time

Difference_k_STAR = np.empty(len(AllCells.var.index.values)*2)

i = 0

GeneList = AllCells.var.index.values

MouseList = AllCells.obs['Mouse'].values

AlignerList = AllCells.obs['Aligner'].values

matrix = AllCells.X

for gene in GeneList:

    for mouse in ['F1','F2']:

        MouseIndex = MouseList == mouse

        subset = matrix[MouseIndex, GeneList == gene]

        temp_diff = np.median(subset[AlignerList[MouseIndex] == 'kallisto']) - np.median(subset[AlignerList[MouseIndex] == 'cellranger'])

        Difference_k_STAR[i] = temp_diff

        i += 1
CPU times: user 34.7 s, sys: 23 ms, total: 34.7 s
Wall time: 34.7 s
df_genes_diff = pd.DataFrame()
df_genes_diff['Mouse'] = np.tile(['F1','F2'],len(GeneList))
df_genes_diff['Difference_k_STAR'] = Difference_k_STAR
df_genes_diff['Genes'] = np.repeat(GeneList,2)
df_genes_diff
Mouse Difference_k_STAR Genes
0 F1 0.0 Sox17
1 F2 0.0 Sox17
2 F1 0.0 Gm6085
3 F2 0.0 Gm6085
4 F1 0.0 Mrpl15
... ... ... ...
38181 F2 0.0 mt-Tp
38182 F1 0.0 AC149090.1
38183 F2 0.0 AC149090.1
38184 F1 0.0 CAAA01147332.1
38185 F2 0.0 CAAA01147332.1

38186 rows × 3 columns

subsetdf = df_genes_diff[np.abs(df_genes_diff['Difference_k_STAR']) > 0.001]
subsetdf
Mouse Difference_k_STAR Genes
98 F1 -0.252593 Rpl7
99 F2 -0.434404 Rpl7
308 F1 -1.308775 Rpl31
324 F1 0.644899 Map4k4
386 F1 0.041473 Col3a1
... ... ... ...
38173 F2 0.066550 mt-Nd5
38174 F1 1.232942 mt-Nd6
38175 F2 1.146825 mt-Nd6
38176 F1 0.047487 mt-Cytb
38177 F2 0.053453 mt-Cytb

601 rows × 3 columns

len(subsetdf[subsetdf['Mouse'] == 'F1'])
403
len(subsetdf[subsetdf['Mouse'] != 'F1'])
198
sns.boxplot(data=subsetdf, y = 'Difference_k_STAR', x = 'Mouse', fliersize=0)
sns.stripplot(data=subsetdf, y = 'Difference_k_STAR', x = 'Mouse')
plt.savefig('/Users/emmanueldollinger/Desktop/DistOfGenes.pdf', dpi=300, transparent = False)

png

subsetdf['diffAbs'] = np.abs(subsetdf['Difference_k_STAR'])
/Users/emmanueldollinger/miniconda3/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
subsetdf.sort_values('diffAbs', ascending=False,inplace=True)
/Users/emmanueldollinger/miniconda3/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
subsetdf
Mouse Difference_k_STAR Genes diffAbs
37189 F2 -2.179982 Fau 2.179982
21934 F1 -2.166188 Rpl13 2.166188
15999 F2 -2.156097 Rps16 2.156097
21935 F2 -2.145788 Rpl13 2.145788
25903 F2 -2.101104 Rps27a 2.101104
... ... ... ... ...
34602 F1 0.001784 Dusp1 0.001784
8312 F1 -0.001540 Selenof 0.001540
12273 F2 -0.001498 Dynll1 0.001498
35116 F1 0.001307 H2-D1 0.001307
23101 F2 -0.001021 Itm2b 0.001021

601 rows × 4 columns

subsetdf['Genes'][0:75].values
array(['Fau', 'Rpl13', 'Rps16', 'Rpl13', 'Rps27a', 'Rps27a', 'Rpl28',
       'Rpl9', 'Rpl9', 'Rps12', 'Rpl11', 'Rps23', 'Rpl11', 'Rpl3', 'Rpl3',
       'Rpl17', 'Rps23', 'Rps12', 'Rpl10', 'Rps13', 'Rpl17', 'Rpl7a',
       'Rps13', 'Rpl10a', 'Rps7', 'Rpl12', 'Rpl12', 'Gnas', 'Rpl10',
       'Rpl21', 'Rpl7a', 'Rpl10a', 'Rps7', 'Rpl21', 'Rpl5', 'Rpl36a',
       'Rpl34', 'Rpl5', 'Rps17', 'Rpl36a', 'Rps16', 'Rpl28', 'Hspa8',
       'Rpl23a', 'Gnas', 'Rpl34', 'Ap1s3', 'Ppib', 'AC160336.1', 'Rps6'],
      dtype=object)
np.sum(subsetdf['Difference_k_STAR'][0:75].values < 0)
64