public class CreateASMCoordsUpdateFilesPlugin
This class traverses the db and writes, to csv files, the haplotypes_id, asm_start_coordinate, asm_end_coordinate, asm_strand values. These files will be loaded to the db into a temp table, from which updates will occur for haplotype nodes. This class is a successor to UpdateDBAsmCoordinatesPlugin.kt. There were issues with batching in the original class - only single updates, one at a time, worked.
For this plugin: Strand shouldn't change, but it will be uploaded as well in kotlin/../api/ConvertVariantContextToVariantInfo:determineASMInfo() the code determines asm coordinates based on chroms being the same, and strand being the same
This example comes from https://tapoueh.org/blog/2018/07/batch-updates-and-concurrency/
The commands you need to do in the db:
create the temp table via: postgres# create temp table batch (haplotypes_id INTEGER, asm_start_coordinate integer, asm_end_coordinate integer, asm_strand varchar(1));
Run this plugin to get the csv file for import
you can either load the files separately, or cat them and then load a single file In folder where the csv files live: > l -1 *.csv > csvFileList.txt > xargs cat < csvFileList.txt > all25files.csv Then sort the file to get all the header lines in 1 place, move a header line back to the top of the file, delete the other header lines. You can do that manually, or use sort, e.g. > sort all25files.csv -t, -k1,1 > all25filesSORTED.csv 3a. (NOT RUN) In the db again run: (this is if you want to dump data to a csv, then add it to a different table.) NOTE: We aren't doing this - we are getting data from this plugin, and loading that phgsmallseq=# \COPY haplotypes(haplotypes_id,asm_start_coordinate, asm_end_coordinate, asm_strand) TO '/Users/lcj34/notes_files/phg_2018/debug/zackFix_mafToGVCF_Oct2022/haplotypesPostgres_just3Cols.csv' with delimiter ',' CSV HEADER;
Import data from the csv files from 2. above to the temp table named "batch" created in 1. above
postgres=# \copy batch from '/workdir/lcj34/zackFixMAFToGvcf_oct2022/asmCSVs/all25files.csv' with csv header delimiter ',';
Now you have data in the temp table - transfer it to the haplotypes table: (only 6 updated in this example as it was smallSeq) phgsmallseq=# update haplotypes set (asm_start_coordinate, asm_end_coordinate) = (batch.asm_start_coordinate, batch.asm_end_coordinate) from batch where batch.haplotypes_id = haplotypes.haplotypes_id and ( haplotypes.asm_start_coordinate, haplotypes.asm_end_coordinate) <> (batch.asm_start_coordinate,batch.asm_end_coordinate); UPDATE 6 phgsmallseq=#
This will work for postgres - not sure if it will work for sqlite For the record: I used the methods outlined above to create 13167015 records for potential updates, which resulted in 12828476 udpates. It took 10.5 minutes to do this update in maize_2_1 on 11/14/22
NOTE: the plugin took 5 hours 27 minutes to run on dc01 on Nov 10, 2022 with 86 gvcf files and lots of haplotypes. You want to run on a cbsu machine, not your laptop.
public CreateASMCoordsUpdateFilesPlugin(@Nullable java.awt.Frame parentFrame, boolean isInteractive)
This class traverses the db and writes, to csv files, the haplotypes_id, asm_start_coordinate, asm_end_coordinate, asm_strand values. These files will be loaded to the db into a temp table, from which updates will occur for haplotype nodes. This class is a successor to UpdateDBAsmCoordinatesPlugin.kt. There were issues with batching in the original class - only single updates, one at a time, worked.
For this plugin: Strand shouldn't change, but it will be uploaded as well in kotlin/../api/ConvertVariantContextToVariantInfo:determineASMInfo() the code determines asm coordinates based on chroms being the same, and strand being the same
This example comes from https://tapoueh.org/blog/2018/07/batch-updates-and-concurrency/
The commands you need to do in the db:
create the temp table via: postgres# create temp table batch (haplotypes_id INTEGER, asm_start_coordinate integer, asm_end_coordinate integer, asm_strand varchar(1));
Run this plugin to get the csv file for import
you can either load the files separately, or cat them and then load a single file In folder where the csv files live: > l -1 *.csv > csvFileList.txt > xargs cat < csvFileList.txt > all25files.csv Then sort the file to get all the header lines in 1 place, move a header line back to the top of the file, delete the other header lines. You can do that manually, or use sort, e.g. > sort all25files.csv -t, -k1,1 > all25filesSORTED.csv 3a. (NOT RUN) In the db again run: (this is if you want to dump data to a csv, then add it to a different table.) NOTE: We aren't doing this - we are getting data from this plugin, and loading that phgsmallseq=# \COPY haplotypes(haplotypes_id,asm_start_coordinate, asm_end_coordinate, asm_strand) TO '/Users/lcj34/notes_files/phg_2018/debug/zackFix_mafToGVCF_Oct2022/haplotypesPostgres_just3Cols.csv' with delimiter ',' CSV HEADER;
Import data from the csv files from 2. above to the temp table named "batch" created in 1. above
postgres=# \copy batch from '/workdir/lcj34/zackFixMAFToGvcf_oct2022/asmCSVs/all25files.csv' with csv header delimiter ',';
Now you have data in the temp table - transfer it to the haplotypes table: (only 6 updated in this example as it was smallSeq) phgsmallseq=# update haplotypes set (asm_start_coordinate, asm_end_coordinate) = (batch.asm_start_coordinate, batch.asm_end_coordinate) from batch where batch.haplotypes_id = haplotypes.haplotypes_id and ( haplotypes.asm_start_coordinate, haplotypes.asm_end_coordinate) <> (batch.asm_start_coordinate,batch.asm_end_coordinate); UPDATE 6 phgsmallseq=#
This will work for postgres - not sure if it will work for sqlite For the record: I used the methods outlined above to create 13167015 records for potential updates, which resulted in 12828476 udpates. It took 10.5 minutes to do this update in maize_2_1 on 11/14/22
NOTE: the plugin took 5 hours 27 minutes to run on dc01 on Nov 10, 2022 with 86 gvcf files and lots of haplotypes. You want to run on a cbsu machine, not your laptop.
public CreateASMCoordsUpdateFilesPlugin()
This class traverses the db and writes, to csv files, the haplotypes_id, asm_start_coordinate, asm_end_coordinate, asm_strand values. These files will be loaded to the db into a temp table, from which updates will occur for haplotype nodes. This class is a successor to UpdateDBAsmCoordinatesPlugin.kt. There were issues with batching in the original class - only single updates, one at a time, worked.
For this plugin: Strand shouldn't change, but it will be uploaded as well in kotlin/../api/ConvertVariantContextToVariantInfo:determineASMInfo() the code determines asm coordinates based on chroms being the same, and strand being the same
This example comes from https://tapoueh.org/blog/2018/07/batch-updates-and-concurrency/
The commands you need to do in the db:
create the temp table via: postgres# create temp table batch (haplotypes_id INTEGER, asm_start_coordinate integer, asm_end_coordinate integer, asm_strand varchar(1));
Run this plugin to get the csv file for import
you can either load the files separately, or cat them and then load a single file In folder where the csv files live: > l -1 *.csv > csvFileList.txt > xargs cat < csvFileList.txt > all25files.csv Then sort the file to get all the header lines in 1 place, move a header line back to the top of the file, delete the other header lines. You can do that manually, or use sort, e.g. > sort all25files.csv -t, -k1,1 > all25filesSORTED.csv 3a. (NOT RUN) In the db again run: (this is if you want to dump data to a csv, then add it to a different table.) NOTE: We aren't doing this - we are getting data from this plugin, and loading that phgsmallseq=# \COPY haplotypes(haplotypes_id,asm_start_coordinate, asm_end_coordinate, asm_strand) TO '/Users/lcj34/notes_files/phg_2018/debug/zackFix_mafToGVCF_Oct2022/haplotypesPostgres_just3Cols.csv' with delimiter ',' CSV HEADER;
Import data from the csv files from 2. above to the temp table named "batch" created in 1. above
postgres=# \copy batch from '/workdir/lcj34/zackFixMAFToGvcf_oct2022/asmCSVs/all25files.csv' with csv header delimiter ',';
Now you have data in the temp table - transfer it to the haplotypes table: (only 6 updated in this example as it was smallSeq) phgsmallseq=# update haplotypes set (asm_start_coordinate, asm_end_coordinate) = (batch.asm_start_coordinate, batch.asm_end_coordinate) from batch where batch.haplotypes_id = haplotypes.haplotypes_id and ( haplotypes.asm_start_coordinate, haplotypes.asm_end_coordinate) <> (batch.asm_start_coordinate,batch.asm_end_coordinate); UPDATE 6 phgsmallseq=#
This will work for postgres - not sure if it will work for sqlite For the record: I used the methods outlined above to create 13167015 records for potential updates, which resulted in 12828476 udpates. It took 10.5 minutes to do this update in maize_2_1 on 11/14/22
NOTE: the plugin took 5 hours 27 minutes to run on dc01 on Nov 10, 2022 with 86 gvcf files and lots of haplotypes. You want to run on a cbsu machine, not your laptop.
@Nullable public net.maizegenetics.plugindef.DataSet processData(@Nullable net.maizegenetics.plugindef.DataSet input)
@NotNull public java.util.List<net.maizegenetics.pangenome.processAssemblyGenomes.CreateASMCoordsUpdateFilesPlugin.UpdateData> findAsmUpdatesForGVCF(@NotNull java.lang.String gvcfFile, @NotNull java.util.Map<java.lang.Integer,net.maizegenetics.pangenome.processAssemblyGenomes.CreateASMCoordsUpdateFilesPlugin.RefRangeData> refIdToRefRangeDataMap)
@Nullable public javax.swing.ImageIcon getIcon()
@NotNull public java.lang.String getButtonName()
@NotNull public java.lang.String getToolTipText()
@NotNull public java.lang.String gvcfDirectory()
Local directory holding all the gvcf files that were processed into haplotypes for the database.
@NotNull public CreateASMCoordsUpdateFilesPlugin gvcfDirectory(@NotNull java.lang.String value)
Set Gvcf Directory. Local directory holding all the gvcf files that were processed into haplotypes for the database.
value
- Gvcf Directory@NotNull public java.lang.String asmCSVdir()
Local directory to which the haplotype update CSV files will be written.
@NotNull public CreateASMCoordsUpdateFilesPlugin asmCSVdir(@NotNull java.lang.String value)
Set Asm C S Vdir. Local directory to which the haplotype update CSV files will be written.
value
- Asm C S Vdir@NotNull public java.lang.String configFile()
Config file with paraeters for database connection.
@NotNull public CreateASMCoordsUpdateFilesPlugin configFile(@NotNull java.lang.String value)
Set Config File. Config file with paraeters for database connection.
value
- Config Filepublic int queueSize()
Size of Queue used to pass information to the DB writing thread. Increase this number to have better thread utilization at the expense of RAM. If you are running into Java heap Space/RAM issues and cannot use a bigger machine, decrease this parameter.
@NotNull public CreateASMCoordsUpdateFilesPlugin queueSize(int value)
Set Queue Size. Size of Queue used to pass information to the DB writing thread. Increase this number to have better thread utilization at the expense of RAM. If you are running into Java heap Space/RAM issues and cannot use a bigger machine, decrease this parameter.
value
- Queue Sizepublic int numThreads()
Number of threads used to upload. The GVCF upload will subtract 2 from this number to have the number of worker threads. It leaves 1 thread for IO to the DB and 1 thread for the Operating System.
@NotNull public CreateASMCoordsUpdateFilesPlugin numThreads(int value)
Set Num Threads. Number of threads used to upload. The GVCF upload will subtract 2 from this number to have the number of worker threads. It leaves 1 thread for IO to the DB and 1 thread for the Operating System.
value
- Num Threads