## Study
Submission Type	Short genetic variations
Submitter Last Name	Kawai
Submitter Middle Name
Submitter First Name	Yosuke
Submitter Affiliation	National Center for Global Health and Medicine
Study Title	Construction of control data for the promotion of genomic medicine for cancers and rare diseases
Study Description	Whole genome sequencing is being promoted for better medical care of rare diseases and cancers. For these disease genome analyses, whole genome sequencing (WGS) analysis data of the healthy control group is necessary. We conducted WGS analysis of healthy individuals for cancers and rare diseases from biobank specimens held by six National Centers (NCs) Biobanks in Japan, taking regional variations into consideration, and construct a genome database of healthy individuals and disease control groups.
Study Type	Control Set
PubMed ID	38060463
Publication DOI	10.1371/journal.pgen.1010625
BioProject Accession	PRJDB18486
Study URL	https://humandbs.dbcls.jp/en/hum0331-v1
Related Study
Study ID	Kawai2024b
Submission Date	2024-01-23
Public Release Date	2024-07-22
Last Update Date

## SampleSet
# SampleSet ID	SampleSet Size	SampleSet Type	SampleSet Name	SampleSet Description	SampleSet Phenotype	SampleSet Population	SampleSet Sex
1	20994	Control	NCGM	NCGM

## Sample
# BioSample Accession	Sample Name	Subject ID	SampleSet ID	Sample Resource	Sample Cell Type	Sample Attribute	Sample Karyotype	Subject Collection	Subject Phenotype	Subject Population	Subject Karyotype	Subject Sex	Subject Age	Subject Age Units	Subject Maternal ID	Subject Paternal ID
SAMD00802407	JPN	not applicable	1			sample size: 9290, description: NCBN sample that falls into the Japanese cluster in the PCA plot. They include all samples from JPN_Hondo and  JPN_Ryukyu, as well as samples located in the middle of those clusters.				Japanese
SAMD00802408	JPN_Hondo	not applicable	1			sample size: 9036, description: Samples belonging to the largest cluster comprising the JPN.				Japanese
SAMD00802409	JPN_Ryukyu	not applicable	1			sample size: 186, description: Samples belonging to the smaller clusters that make up the JPN. Presumed to be individuals with ancestry in the Ryukyu Islands.				Japanese
SAMN07488236	ACB	not applicable	1			sample size: 94, description: African Caribbean in Barbados (1000 Genomes)		1000 Genomes		African-Caribbean
SAMN07488235	ASW	not applicable	1			sample size: 56, description: African Ancestry in Southwest US (1000 Genomes)		1000 Genomes		African-American SW
SAMN07488215	BEB	not applicable	1			sample size: 85, description: Bengali in Bangladesh (1000 Genomes)		1000 Genomes		Bengali
SAMN07488225	CDX	not applicable	1			sample size: 93, description: Chinese Dai in Xishuangbanna, China (1000 Genomes)		1000 Genomes		Dai Chinese
SAMN07488216	CEU	not applicable	1			sample size: 98, description: Utah residents (CEPH) with Northern and Western European ancestry (1000 Genomes)		1000 Genomes		CEPH
SAMN07488224	CHB	not applicable	1			sample size: 103, description: Han Chinese in Beijing, China (1000 Genomes)		1000 Genomes		Han Chinese
SAMN07488221	CHS	not applicable	1			sample size: 104, description: Han Chinese South (1000 Genomes)		1000 Genomes		Southern Han Chinese
SAMN07488229	CLM	not applicable	1			sample size: 92, description: Colombian in Medellin, Colombia (1000 Genomes)		1000 Genomes		Colombian
SAMN07488234	ESN	not applicable	1			sample size: 98, description: Esan in Nigeria (1000 Genomes)		1000 Genomes		Esan
SAMN07488219	FIN	not applicable	1			sample size: 99, description: Finnish in Finland (1000 Genomes)		1000 Genomes		Finnish
SAMN07488220	GBR	not applicable	1			sample size: 91, description: British in England and Scotland (1000 Genomes)		1000 Genomes		British
SAMN07488214	GIH	not applicable	1			sample size: 102, description: Gujarati Indian in Houston, TX (1000 Genomes)		1000 Genomes		Gujarati
SAMN07488233	GWD	not applicable	1			sample size: 113, description: Gambian in Western Division, The Gambia (1000 Genomes)		1000 Genomes		Gambian
SAMN07488218	IBS	not applicable	1			sample size: 107, description: Iberian populations in Spain (1000 Genomes)		1000 Genomes		Spanish
SAMN07488213	ITU	not applicable	1			sample size: 102, description: Indian Telugu in the UK (1000 Genomes)		1000 Genomes		Indian
SAMN07488223	JPT	not applicable	1			sample size: 104, description: Japanese in Tokyo, Japan (1000 Genomes)		1000 Genomes		Japanese
SAMN07488222	KHV	not applicable	1			sample size: 99, description: Kinh in Ho Chi Minh City, Vietnam (1000 Genomes)		1000 Genomes		Kinh Vietnamese
SAMN07488232	LWK	not applicable	1			sample size: 96, description: Luhya in Webuye, Kenya (1000 Genomes)		1000 Genomes		Luhya
SAMN07488231	MSL	not applicable	1			sample size: 85, description: Mende in Sierra Leone (1000 Genomes)		1000 Genomes		Mende
SAMN07488228	MXL	not applicable	1			sample size: 64, description: Mexican Ancestry in Los Angeles, California (1000 Genomes)		1000 Genomes		Mexican-American
SAMN07488227	PEL	not applicable	1			sample size: 84, description: Peruvian in Lima, Peru (1000 Genomes)		1000 Genomes		Peruvian
SAMN07488212	PJL	not applicable	1			sample size: 96, description: Punjabi in Lahore, Pakistan (1000 Genomes)		1000 Genomes		Punjabi
SAMN07488226	PUR	not applicable	1			sample size: 104, description: Puerto Rican in Puerto Rico (1000 Genomes)		1000 Genomes		Puerto Rican
SAMN07488211	STU	not applicable	1			sample size: 99, description: Sri Lankan Tamil in the UK (1000 Genomes)		1000 Genomes		Sri Lankan
SAMN07488217	TSI	not applicable	1			sample size: 107, description: Toscani in Italia (1000 Genomes)		1000 Genomes		Tuscan
SAMN07488230	YRI	not applicable	1			sample size: 107, description: Yoruba in Ibadan, Nigeria (1000 Genomes)		1000 Genomes		Yoruba

## Experiment
# Experiment ID	Experiment Type	Method Type	Analysis Type	Reference Type	Reference Value	Merged Experiment IDs	Experiment Resolution	Method Platform	Method Description	Analysis Description	Detection Method	Detection Description	External Links
1	Discovery	Sequencing	Paired-end mapping	Assembly	GRCh38			Illumina NovaSeq 6000	DNA samples that meet the criteria for the study will be shipped from the biobank and subjected to WGS analysis at a contract analysis laboratory. WGS analysis will be performed on a Novaseq6000 sequencer using a PCR-free protocol to obtain a minimum output of 90 Gb.
- Confirm library size is 400bp-750bp.
- At least 75% of the bases are QV30 or better.
- Total number of bases after removal of duplicate reads by FASTQC is more than 90 GBase.	The read data in fastq format obtained from the analysis will be subjected to data analysis (mapping and variant calling) at the principal institute (National Center for Global Health and Medicine), and the data including variant information will be made into a database.
- Samples with abnormal values for depth and mapping rate.
- Samples where the depth of the sex chromosome is inconsistent with the clinical information.
- Any of the samples determined to be within the second degree of kinship in the KING program.

Variant call results were filtered for the following
- Genotypes with GQ64 or with less than 25% minor alleles in heterozygous calls are set to no call
- Set VQSR results to FILTER field in VCF
- Set LowCR in FILTER field for variants with less than 95% call rate
- Variants with a Hardy-Weinberg equilibrium test P-value less than 10-6 have HWE set in FILTER field	HaplotypeCaller (GATK4.1.0) compatible algorithm (Parabricks 3.1.0 haplotypecaller)

## Dataset
# Dataset ID	SampleSet ID	Experiment ID	Number of Chromosomes Sampled	Dataset Description	Linkout URL	submitted/VCF Filename
1	1	1	23544	chromosome 1		submitted/chr1.vcf
2	1	1	23544	chromosome 2		submitted/chr2.vcf
3	1	1	23544	chromosome 3		submitted/chr3.vcf
4	1	1	23544	chromosome 4		submitted/chr4.vcf
5	1	1	23544	chromosome 5		submitted/chr5.vcf
6	1	1	23544	chromosome 6		submitted/chr6.vcf
7	1	1	23544	chromosome 7		submitted/chr7.vcf
8	1	1	23544	chromosome 8		submitted/chr8.vcf
9	1	1	23544	chromosome 9		submitted/chr9.vcf
10	1	1	23544	chromosome 10		submitted/chr10.vcf
11	1	1	23544	chromosome 11		submitted/chr11.vcf
12	1	1	23544	chromosome 12		submitted/chr12.vcf
13	1	1	23544	chromosome 13		submitted/chr13.vcf
14	1	1	23544	chromosome 14		submitted/chr14.vcf
15	1	1	23544	chromosome 15		submitted/chr15.vcf
16	1	1	23544	chromosome 16		submitted/chr16.vcf
17	1	1	23544	chromosome 17		submitted/chr17.vcf
18	1	1	23544	chromosome 18		submitted/chr18.vcf
19	1	1	23544	chromosome 19		submitted/chr19.vcf
20	1	1	23544	chromosome 20		submitted/chr20.vcf
21	1	1	23544	chromosome 21		submitted/chr21.vcf
22	1	1	23544	chromosome 22		submitted/chr22.vcf
23	1	1	18782	chromosome X		submitted/chrX.vcf