Class RddRdfWriter<T>

Type Parameters:
T -

public class RddRdfWriter<T> extends RddRdfWriterSettings<RddRdfWriter<T>>
Important: Instances of this class should only be created using RddRdfWriterFactory because the factory is RDD-independent and can validate settings at an early stage.

This class implements a fluent API for configuration of how to save an RDD of RDF data to disk. This class uniformly handles Triples, Quads, Model, Datasets, etc using a set of lambdas for relevant conversion. Instances of this class should be created using the appropriate createFor[Type] methods.

  • Field Details

    • dispatcher

      protected RddRdfOpsImpl<T> dispatcher
      References the lambdas in RddRdfOpsImpl directly (saves one entry in the call stack per record)
    • sparkContext

      protected org.apache.spark.api.java.JavaSparkContext sparkContext
    • rdd

      protected org.apache.spark.api.java.JavaRDD<? extends T> rdd
    • hadoopConfiguration

      protected org.apache.hadoop.conf.Configuration hadoopConfiguration
  • Constructor Details

  • Method Details

    • setRdd

      public RddRdfWriter<T> setRdd(org.apache.spark.api.java.JavaRDD<? extends T> rdd)
    • getRdd

      public org.apache.spark.api.java.JavaRDD<? extends T> getRdd()
    • runUnchecked

      public void runUnchecked()
      Same as run() but without the checked IOException
    • run

      public void run() throws IOException
      Throws:
      IOException
    • getEffectiveRdd

      public org.apache.spark.api.java.JavaRDD<T> getEffectiveRdd(RdfPostProcessingSettings settings)
      Create the effective RDD w.r.t. configuration (sort, unqiue, optimize prefixes) If optimize prefixes is enabled then invoking this method will immediately perform that analysis The current behavior is that this writer's prefix map will be updated to the used prefixes. However, this is subject to change such that a new writer instance with the used prefixes is created.
    • runOutputToConsole

      protected void runOutputToConsole() throws IOException
      Throws:
      IOException
    • runActual

      public void runActual(RddWriterSettings<?> cxt) throws IOException
      Throws:
      IOException
    • runSpark

      public void runSpark() throws IOException
      Run the save action according to configuration
      Throws:
      IOException
    • toString

      public static String toString(org.apache.jena.shared.PrefixMapping prefixMapping, org.apache.jena.riot.RDFFormat rdfFormat)
      Convert a prefix mapping to a string
    • partitionMapperNTriples

      public static Iterator<String> partitionMapperNTriples(Iterator<org.apache.jena.graph.Triple> it)
    • partitionMapperNQuads

      public static Iterator<String> partitionMapperNQuads(Iterator<org.apache.jena.sparql.core.Quad> it)
    • createStreamRDFFactory

      public static Function<OutputStream,org.apache.jena.riot.system.StreamRDF> createStreamRDFFactory(org.apache.jena.riot.RDFFormat rdfFormat, boolean mapQuadsToTriplesForTripleLangs, org.apache.jena.shared.PrefixMapping prefixMapping)
      Create a function that can create a StreamRDF instance that is backed by the given OutputStream.
      Parameters:
      rdfFormat -
      prefixMapping -
      Returns:
    • partitionMapperRDFStream

      public static <T> org.aksw.commons.lambda.throwing.ThrowingFunction<Iterator<T>,Iterator<String>> partitionMapperRDFStream(Function<OutputStream,org.apache.jena.riot.system.StreamRDF> streamRDFFactory, BiConsumer<? super T,org.apache.jena.riot.system.StreamRDF> sendRecordToWriter)
    • saveToFolder

      @Deprecated public static <T> void saveToFolder(org.apache.spark.api.java.JavaRDD<T> javaRdd, String path, org.apache.jena.riot.RDFFormat rdfFormat, boolean mapQuadsToTriplesForTripleLangs, org.apache.jena.shared.PrefixMapping globalPrefixMapping, BiConsumer<T,org.apache.jena.riot.system.StreamRDF> sendRecordToStreamRDF) throws IOException
      Deprecated.
      Save the data in Trig/Turtle or its sub-formats (n-quads/n-triples) format. If prefixes should be written out then they have to provided as an argument to the prefixMapping parameter. Prefix mappings are broadcasted to and processed in a .mapPartition operation. If the prefixMapping is non-empty then the first part file written out contains them. No other partition will write out prefixes.
      Parameters:
      path - the folder into which the file(s) will be written to mode the expected behavior of saving the data to a data source
      Throws:
      IOException
    • saveUsingElephas

      public static <T> void saveUsingElephas(org.apache.spark.api.java.JavaRDD<T> rdd, org.apache.hadoop.fs.Path path, org.apache.jena.riot.Lang lang, org.aksw.commons.lambda.serializable.SerializableFunction<? super T,?> recordToWritable)
    • createForTriple

      public static RddRdfWriter<org.apache.jena.graph.Triple> createForTriple()
    • createForQuad

      public static RddRdfWriter<org.apache.jena.sparql.core.Quad> createForQuad()
    • createForGraph

      public static RddRdfWriter<org.apache.jena.graph.Graph> createForGraph()
    • createForDatasetGraph

      public static RddRdfWriter<org.aksw.jenax.arq.dataset.api.DatasetGraphOneNg> createForDatasetGraph()
    • createForModel

      public static RddRdfWriter<org.apache.jena.rdf.model.Model> createForModel()
    • createForDataset

      public static RddRdfWriter<org.aksw.jenax.arq.dataset.api.DatasetOneNg> createForDataset()
    • validate

      public static void validate(RddRdfWriterSettings<?> settings)
    • sendToStreamRDF

      public static <T> void sendToStreamRDF(org.apache.spark.api.java.JavaRDD<T> javaRdd, org.aksw.commons.lambda.serializable.SerializableBiConsumer<T,org.apache.jena.riot.system.StreamRDF> sendRecordToStreamRDF, org.aksw.commons.lambda.serializable.SerializableSupplier<org.apache.jena.riot.system.StreamRDF> streamRdfSupplier)