シングルセル解析用のデータベース作成　～SQL編～

前回の記事では、シングルセル解析の結果を格納するためのデータベース（RDB）のテーブル設計を行いました。

本記事では、実際にRDBにデータを格納して、そのデータをSQLクエリで取得するところまでを行います。

テーブル設計のおさらい

まずは、前回作成したテーブル設計図を一度確認しておきます。こちらは、使い勝手を考慮して一部変更されています。

f:id:emoriiin979:20210213105333p:plain

変更箇所としては、研究テーブルのリレーションを変更し、一つの研究で複数の検体が採用されているという形式にしました。こうすることで、例えばTabula Murisデータが複数の研究で利用されているという状況もカバーできます。

細胞データの階層構造としては、個体>組織>検体>細胞という関係となっています。新たに検体テーブルを追加したので、細胞と組織のリレーションは削除しました。一方で、動物種>組織>細胞タイプ>細胞サブタイプの階層関係も別に存在するので、組織テーブルは二つのリレーションを兼務するものとなっています。

テーブルの作成

まずは、定義したテーブルをDBMS上で作成します。今回使用したDBMSは、今後Google Colabに持っていきやすいものが良かったので、SQLiteを選択しました。

※ただし、結論を先に言ってしまうと、この選択はあまり良くありませんでした。シングルセル解析データのような大量データはGoogle Colabにアップロードしづらいので、MySQLなどでDBサーバーを立てて、そこからリモート接続してデータを取得するようにした方が良いです。

SQLite3をダウンロードして、コマンドプロンプトやターミナルなどで下記のコマンドを入力します。

> sqlite3 singlecell.db

すると、SQLiteのコマンド待ち状態となるので、そこへSQLクエリを入力していきます。

> create table projects(
>   project_id text,
>   title text,
>   primary key(project_id)
> );

MySQLなどと同じで、CREATE TABLEコマンドで新たなテーブルを作成できます。

SQLiteの注意点として、外部キーの設定ができない（できるけど面倒）という問題があります。絶対に付けたいという人でなければ問題ないかと思いますが、もし必要ならMySQLなど他のDBMSを使うと良いでしょう。

この作業を、残りのテーブルに対しても実行します。

> create table project_individuals(project_id text not null, individual_id text not null, primary key(project_id, individual_id));
> create table individuals(individual_id text not null, organism_id text not null, primary key(individual_id));
> create table samples(sample_id text not null, individual_id text not null, tissue_id text not null, primary key(sample_id));
> create table tissues(tissue_id text not null, organism_id text not null, tissue_name text not null, primary key(tissue_id));
> create table organisms(organism_id text not null, organism_name text not null, primary key(organism_id));
> create table cells(cell_id text not null, sample_id text, type_id text, subtype_id text, age int, primary key(cell_id));
> create table cell_types(type_id text not null, tissue_id text not null, type_name text not null, primary key(type_id));
> create table cell_subtypes(type_id text not null, subtype_id text not null, subtype_name text not null, primary key(type_id, subtype_id));
> create table gene_expressions(cell_id text not null, gene_id text not null, count int, primary key(cell_id, gene_id));
> create table genes(gene_id text not null, browser_id text not null, gene_name text, primary key(gene_id));
> create table gene_browsers(browser_id text not null, browser_name text, primary key(browser_id));

作成したテーブルは、.tablesコマンドで確認できます。

f:id:emoriiin979:20210212183605p:plain

CSVファイルの作成と取り込み

SQLiteでは、CSVファイル内のデータを読み込みテーブルへ格納することができます。ここからは、テーブルに取り込むためのCSVファイルをPythonで作成します。

まずは、データセットで使用されているIDや値をテーブル定義で作成したIDに変換するための辞書を作成します。

# old & animal -> individual_id
ia_dict = {
    'young': { 3: 'IDV000001', 4: 'IDV000002', 5: 'IDV000003' },
    'old': { 1: 'IDV000004', 2: 'IDV000005', 3: 'IDV000006' },
}

# replicate -> sample_id 
sr_dict = { 1: 'SMP001', 2: 'SMP002' }

# cell_type -> type_id (cell ontology)
tc_dict = {
    'leukocyte': 'CL:0000738',
    'lung endothelial cell': 'CL:1001567',
    'alveolar macrophage': 'CL:0000583',
    'myeloid cell': 'CL:0000763',
    'classical monocyte': 'CL:0000860',
    'stromal cell': 'CL:0000499',
    'B cell': 'CL:0000236',
    'T cell': 'CL:0000084',
    'non-classical monocyte': 'CL:0000875',
    'natural killer cell': 'CL:0000623',
    'mast cell': 'CL:0000097',
    'type II pneumocyte': 'CL:0002063',
    'ciliated columnar cell of tracheobronchial tree': 'CL:0002145',
}

# subtype -> type_id & subtype_id
in_dict = {
    'Npnt stromal cell': [tc_dict['stromal cell'], 'CLS001'],
    'Hhip stromal cell': [tc_dict['stromal cell'], 'CLS002'],
    'Gucy1a3 stromal cell': [tc_dict['stromal cell'], 'CLS003'],
    'Dcn stromal cell': [tc_dict['stromal cell'], 'CLS004'],
    'CD4 T cell': [tc_dict['T cell'], 'CLS001'],
    'CD8 T cell': [tc_dict['T cell'], 'CLS002'],
}

作成した辞書を使って、シングルセル解析データの中身をCSVファイルに変換します。今回使用するデータは、いつものようにAdata型データとして読み込みます。

os.system('wget -nv https://storage.googleapis.com/calico-website-mca-storage/lung.h5ad')
adata = sc.read_h5ad('lung.h5ad')
obs = adata.obs

次に、CSVファイルに変換するための関数を定義します。

def outCSVFile(filename, **kwargs):
    pd.DataFrame(kwargs).to_csv(filename + '.csv', header=False, index=False)

それでは、準備が完了したのでadataから必要なデータを取得・変換し、CSVファイルを作っていきます。

# projects
project_id = np.array(['PRJ000001'])
title = np.array(['Murine single-cell RNA-seq reveals cell-identity- and tissue-specific trajectories of aging'])
outCSVFile('projects', project_id=project_id, title=title)

# organisms
organism_id = np.array(['ORG000001', 'ORG000002'])
organism_name = np.array(['Homo Sapiens', 'Mus Musculus'])
outCSVFile('organisms', organism_id=organism_id, organism_name=organism_name)

# individuals
individual_id = np.array(['IDV000001', 'IDV000002', 'IDV000003', 'IDV000004', 'IDV000005', 'IDV000006'])
organism_id = np.array(['ORG000002'] * 6)
age = np.repeat(np.array(['young', 'old']), 3)
outCSVFile('individuals', individual_id=individual_id, organism_id=organism_id, age=age)

# project_individuals
project_id2 = np.array(['PRJ000001'] * 6)
individual_id2 = individual_id
outCSVFile('project_individuals', project_id=project_id2, individual_id=individual_id2)

# tissues
tissue_id = np.array(['TSS000001'])
organism_id = np.array(['ORG000002'])
tissue_name = np.array(['Lung'])
outCSVFile('tissues', tissue_id=tissue_id, organism_id=organism_id, tissue_name=tissue_name)

# samples
individual_id2 = np.repeat(individual_id, 2)[:-1]
sample_id = np.tile(np.array(['SMP001', 'SMP002']), 6)[:-1]
tissue_id2 = np.array(['TSS000001'] * 11)
outCSVFile('samples', individual_id=individual_id2, sample_id=sample_id, tissue_id=tissue_id2)

# gene_browsers
browser_id = np.array(['BRW000001'])
browser_name = np.array(['Ensembl'])
outCSVFile('gene_browsers', browser_id=browser_id, browser_name=browser_name)

# genes
gene_id = np.array(adata.var['gene_ids-0'].tolist())
browser_id2 = np.array(['BRW000001'] * len(gene_id))
gene_symbol = np.array(adata.var.index.tolist())
outCSVFile('genes', gene_id=gene_id, browser_id=browser_id2, gene_symbol=gene_symbol)

# cell_types
type_name = np.array(adata.obs['cell_type'].unique())
tissue_id2 = np.array(['TSS000001'] * len(type_name))
type_id = np.array([tc_dict[x] for x in type_name])
outCSVFile('cell_types', type_id=type_id, tissue_id=tissue_id2, type_name=type_name)

# cell_subtypes
d = obs[['cell_type', 'subtype']]
subtype_name = np.array(d[~d.duplicated()][d['cell_type'].cat.set_categories(d['subtype'].cat.categories) != d['subtype']]['subtype'].tolist())
type_id2 = np.array([in_dict[x][0] for x in subtype_name])
subtype_id = np.array([in_dict[x][1] for x in subtype_name])
outCSVFile('cell_subtypes', type_id=type_id2, subtype_id=subtype_id, subtype_name=subtype_name)

# cells
cell_id = np.array(['CLL{:0=12}'.format(i+1) for i in range(len(obs))])
individual_id2 = np.array([ia_dict[age][animal] for age, animal in zip(obs['age'], obs['animal'])])
sample_id2 = np.array([sr_dict[rep] for rep in obs['replicate']])
type_id2 = np.array([tc_dict[cell_type] for cell_type in obs['cell_type']])
subtype_id2 = np.array([in_dict[x][1] if x in in_dict else '' for x in obs['subtype']])
outCSVFile('cells', cell_id=cell_id, individual_id=individual_id2, sample_id=sample_id2, type_id=type_id2, subtype_id=subtype_id2)

# gene_expressions
df_X = pd.DataFrame(adata.X.todense())
df_X.columns = gene_id
df_X['cell_id'] = cell_id
df_X.melt(id_vars='cell_id', var_name='gene_id', value_name='count').to_csv('gene_expressions.csv', header=False, index=False)

これで、全てのCSVファイルの作成が完了しました。

これらのファイルを、.importコマンドを使って取り込みます。

.mode csv
.import cell_subtypes.csv cell_subtypes
.import cell_types.csv cell_types
.import cells.csv cells
.import gene_browsers.csv gene_browsers
.import gene_expressions.csv gene_expressions
.import genes.csv genes
.import individuals.csv individuals
.import organisms.csv organisms
.import project_individuals.csv project_individuals
.import projects.csv projects
.import samples.csv samples
.import tissues.csv tissues

データの取得

これで、全てのデータがテーブルに格納されました。最後にらSELECT文で必要なデータを取得できるか確認します。

select
  ct.type_name
  , g.gene_symbol
  , i.age
  , count(*) as n_data
from
  gene_expressions ge
  left outer join genes g
    on g.gene_id = ge.gene_id
  left outer join cells c
    on c.cell_id = ge.cell_id
  left outer join cell_types ct
    on ct.type_id = c.type_id
  left outer join samples s
    on s.individual_id = c.individual_id
    and s.sample_id = c.sample_id
  left outer join individuals i
    on i.individual_id = s.individual_id
where
  ge.gene_id = 'ENSMUSG00000025902'
  and ct.type_name = 'T cell'
group by
  ct.type_name
  , g.gene_symbol
  , i.age;

f:id:emoriiin979:20210213134734p:plain

このように、ファイルからでなくデータベースからデータの取得ができることが確認できました。あとは、使用しているプログラミング言語に応じて必要な処理を導入すれば完了です。

まとめ

本記事では、前回作成したテーブル設計図をもとに、SQLiteでデータベースを作成しました。

RDBにデータを格納しておけば大量データをメモリ外で捌くことができるようになりますし、複数人での運用も可能となるので、試せる人は是非試して欲しいと思います。

今回はSQL系のデータベースシステムを扱いましたが、他にもNoSQL系やHadoopのような分散システムなどが存在するので、それらについても調べてみると面白いかもしれません。

以上です。