All Docs 1.6 Download

4/3/2020

Looking for Oracle OpenJDK builds? Oracle Customers and ISVs targeting Oracle LTS releases: Oracle JDK is Oracle's supported Java SE version for customers and for developing, testing, prototyping or demonstrating your Java applications. End users and developers looking for free JDK versions: Oracle OpenJDK offers the same features and performance as Oracle JDK under the GPL license.

-->

By Mark Russinovich

Published: December 11, 2018

Download SDelete(221 KB)

Introduction

One feature of Windows NT/2000's (Win2K) C2-compliance is that itimplements object reuse protection. This means that when an applicationallocates file space or virtual memory it is unable to view data thatwas previously stored in the resources Windows NT/2K allocates for it.Windows NT zero-fills memory and zeroes the sectors on disk where a fileis placed before it presents either type of resource to an application.However, object reuse does not dictate that the space that a fileoccupies before it is deleted be zeroed. This is because Windows NT/2Kis designed with the assumption that the operating system controlsaccess to system resources. However, when the operating system is notactive it is possible to use raw disk editors and recovery tools to viewand recover data that the operating system has deallocated. Even whenyou encrypt files with Win2K's Encrypting File System (EFS), a file'soriginal unencrypted file data is left on the disk after a new encryptedversion of the file is created.

The only way to ensure that deleted files, as well as files that youencrypt with EFS, are safe from recovery is to use a secure deleteapplication. Secure delete applications overwrite a deleted file'son-disk data using techiques that are shown to make disk dataunrecoverable, even using recovery technology that can read patterns inmagnetic media that reveal weakly deleted files. SDelete (SecureDelete) is such an application. You can use SDelete both to securelydelete existing files, as well as to securely erase any file data thatexists in the unallocated portions of a disk (including files that youhave already deleted or encrypted). SDelete implements the Departmentof Defense clearing and sanitizing standard DOD 5220.22-M, to give youconfidence that once deleted with SDelete, your file data is goneforever. Note that SDelete securely deletes file data, but not filenames located in free disk space.

Using SDelete

SDelete is a command line utility that takes a number of options. Inany given use, it allows you to delete one or more files and/ordirectories, or to cleanse the free space on a logical disk. SDeleteaccepts wild card characters as part of the directory or file specifier.

Usage: sdelete [-p passes] [-s] [-q] <file or directory>...
sdelete [-p passes] [-z|-c] [drive letter] ...

Parameter	Description
-a	Remove Read-Only attribute.
-c	Clean free space.
-p passes	Specifies number of overwrite passes (default is 1).
-q	Don't print errors (Quiet).
-s or -r	Recurse subdirectories.
-z	Zero free space (good for virtual disk optimization).

How SDelete Works

Securely deleting a file that has no special attributes is relativelystraight-forward: the secure delete program simply overwrites the filewith the secure delete pattern. What is more tricky is securely deletingWindows NT/2K compressed, encrypted and sparse files, and securelycleansing disk free spaces.

Compressed, encrypted and sparse are managed by NTFS in 16-clusterblocks. If a program writes to an existing portion of such a file NTFSallocates new space on the disk to store the new data and after the newdata has been written, deallocates the clusters previously occupied bythe file. NTFS takes this conservative approach for reasons related todata integrity, and in the case of compressed and sparse files, in casea new allocation is larger than what exists (the new compressed data isbigger than the old compressed data). Thus, overwriting such a file willnot succeed in deleting the file's contents from the disk.

To handle these types of files SDelete relies on the defragmentationAPI. Using the defragmentation API, SDelete can determine preciselywhich clusters on a disk are occupied by data belonging to compressed,sparse and encrypted files. Once SDelete knows which clusters containthe file's data, it can open the disk for raw access and overwrite thoseclusters.

Cleaning free space presents another challenge. Since FAT and NTFSprovide no means for an application to directly address free space,SDelete has one of two options. The first is that it can, like it doesfor compressed, sparse and encrypted files, open the disk for raw accessand overwrite the free space. This approach suffers from a big problem:even if SDelete were coded to be fully capable of calculating the freespace portions of NTFS and FAT drives (something that's not trivial), itwould run the risk of collision with active file operations taking placeon the system. For example, say SDelete determines that a cluster isfree, and just at that moment the file system driver (FAT, NTFS) decidesto allocate the cluster for a file that another application ismodifying. The file system driver writes the new data to the cluster,and then SDelete comes along and overwrites the freshly written data:the file's new data is gone. The problem is even worse if the cluster isallocated for file system metadata since SDelete will corrupt the filesystem's on-disk structures.

The second approach, and the one SDelete takes, is to indirectlyoverwrite free space. First, SDelete allocates the largest file itcan. SDelete does this using non-cached file I/O so that the contentsof the NT file system cache will not be thrown out and replaced withuseless data associated with SDelete's space-hogging file. Becausenon-cached file I/O must be sector (512-byte) aligned, there might besome left over space that isn't allocated for the SDelete file evenwhen SDelete cannot further grow the file. To grab any remaining spaceSDelete next allocates the largest cached file it can. For both ofthese files SDelete performs a secure overwrite, ensuring that all thedisk space that was previously free becomes securely cleansed.

On NTFS drives SDelete's job isn't necessarily through after itallocates and overwrites the two files. SDelete must also fill anyexisting free portions of the NTFS MFT (Master File Table) with filesthat fit within an MFT record. An MFT record is typically 1KB in size,and every file or directory on a disk requires at least one MFT record.Small files are stored entirely within their MFT record, while filesthat don't fit within a record are allocated clusters outside the MFT.All SDelete has to do to take care of the free MFT space is allocatethe largest file it can - when the file occupies all the available spacein an MFT Record NTFS will prevent the file from getting larger, sincethere are no free clusters left on the disk (they are being held by thetwo files SDelete previously allocated). SDelete then repeats theprocess. When SDelete can no longer even create a new file, it knowsthat all the previously free records in the MFT have been completelyfilled with securely overwritten files.

To overwrite file names of a file that you delete, SDelete renames thefile 26 times, each time replacing each character of the file's namewith a successive alphabetic character. For instance, the first renameof 'foo.txt' would be to 'AAA.AAA'.

The reason that SDelete does not securely delete file names whencleaning disk free space is that deleting them would require directmanipulation of directory structures. Directory structures can have freespace containing deleted file names, but the free directory space is notavailable for allocation to other files. Hence, SDelete has no way ofallocating this free space so that it can securely overwrite it.

Download SDelete(151 KB)

Runs on:

Client: Windows Vista and higher
Server: Windows Server 2008 and higher
Nano Server: 2016 and higher

Spark SQL, DataFrames and Datasets Guide.OverviewSpark SQL is a Spark module for structured data processing. Unlike the basic Spark RDD API, the interfaces providedby Spark SQL provide Spark with more information about the structure of both the data and the computation being performed. Internally,Spark SQL uses this extra information to perform extra optimizations. There are several ways tointeract with Spark SQL including SQL, the DataFrames API and the Datasets API. When computing a resultthe same execution engine is used, independent of which API/language you are using to express thecomputation. This unification means that developers can easily switch back and forth between thevarious APIs based on which provides the most natural way to express a given transformation.All of the examples on this page use sample data included in the Spark distribution and can be run inthe spark-shell, pyspark shell, or sparkR shell.

SQLOne use of Spark SQL is to execute SQL queries written using either a basic SQL syntax or HiveQL.Spark SQL can also be used to read data from an existing Hive installation. For more on how toconfigure this feature, please refer to the section. When runningSQL from within another programming language the results will be returned as a.You can also interact with the SQL interface using theor over. DataFramesA DataFrame is a distributed collection of data organized into named columns. It is conceptuallyequivalent to a table in a relational database or a data frame in R/Python, but with richeroptimizations under the hood. DataFrames can be constructed from a wide array of suchas: structured data files, tables in Hive, external databases, or existing RDDs.The DataFrame API is available in,.

DatasetsA Dataset is a new experimental interface added in Spark 1.6 that tries to provide the benefits ofRDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL’soptimized execution engine. A Dataset can be from JVM objects and then manipulatedusing functional transformations (map, flatMap, filter, etc.).The unified Dataset API can be used both in.

Python does not yet have support forthe Dataset API, but due to its dynamic nature many of the benefits are already available (i.e. You canaccess the field of a row by name naturally row.columnName). Full python support will be addedin a future release. Getting Started Starting Point: SQLContext. Val sc: SparkContext // An existing SparkContext.

Val sqlContext = new org. SQLContext ( sc ) // Create the DataFrame val df = sqlContext. Json ( 'examples/src/main/resources/people.json' ) // Show the content of the DataFrame df. Show // age name // null Michael // 30 Andy // 19 Justin // Print the schema in a tree format df. PrintSchema // root // - age: long (nullable = true) // - name: string (nullable = true) // Select only the 'name' column df. Select ( 'name' ).

Show // name // Michael // Andy // Justin // Select everybody, but increment the age by 1 df. Select ( df ( 'name' ), df ( 'age' ) + 1 ). Show // name (age + 1) // Michael null // Andy 31 // Justin 20 // Select people older than 21 df.

Filter ( df ( 'age' ) 21 ). Show // age name // 30 Andy // Count people by age df. GroupBy ( 'age' ).

Show // age count // null 1 // 19 1 // 30 1For a complete list of the types of operations that can be performed on a DataFrame refer to the.In addition to simple column references and expressions, DataFrames also have a rich library of functions including string manipulation, date arithmetic, common math operations and more. The complete list is available in the. JavaSparkContext sc // An existing SparkContext. SQLContext sqlContext = new org. SQLContext ( sc ) // Create the DataFrame DataFrame df = sqlContext.

Json ( 'examples/src/main/resources/people.json' ); // Show the content of the DataFrame df. Show ; // age name // null Michael // 30 Andy // 19 Justin // Print the schema in a tree format df.

PrintSchema ; // root // - age: long (nullable = true) // - name: string (nullable = true) // Select only the 'name' column df. Select ( 'name' ). Show ; // name // Michael // Andy // Justin // Select everybody, but increment the age by 1 df.

Col ( 'name' ), df. Col ( 'age' ). Show ; // name (age + 1) // Michael null // Andy 31 // Justin 20 // Select people older than 21 df. Col ( 'age' ). Show ; // age name // 30 Andy // Count people by age df. GroupBy ( 'age' ). Show ; // age count // null 1 // 19 1 // 30 1For a complete list of the types of operations that can be performed on a DataFrame refer to the.In addition to simple column references and expressions, DataFrames also have a rich library of functions including string manipulation, date arithmetic, common math operations and more.

The complete list is available in the. From pyspark.sql import SQLContext sqlContext = SQLContext ( sc ) # Create the DataFrame df = sqlContext. Json ( 'examples/src/main/resources/people.json' ) # Show the content of the DataFrame df. Show ## age name ## null Michael ## 30 Andy ## 19 Justin # Print the schema in a tree format df. PrintSchema ## root ## - age: long (nullable = true) ## - name: string (nullable = true) # Select only the 'name' column df. Select ( 'name' ). Show ## name ## Michael ## Andy ## Justin # Select everybody, but increment the age by 1 df.

Select ( df 'name' , df 'age' + 1 ). Show ## name (age + 1) ## Michael null ## Andy 31 ## Justin 20 # Select people older than 21 df.

Filter ( df 'age' 21 ). Show ## age name ## 30 Andy # Count people by age df.

GroupBy ( 'age' ). Show ## age count ## null 1 ## 19 1 ## 30 1For a complete list of the types of operations that can be performed on a DataFrame refer to the.In addition to simple column references and expressions, DataFrames also have a rich library of functions including string manipulation, date arithmetic, common math operations and more. The complete list is available in the. SqlContext 21 )) ## age name ## 30 Andy # Count people by ageshowDF (count (groupBy (df, 'age' ))) ## age count ## null 1 ## 19 1 ## 30 1For a complete list of the types of operations that can be performed on a DataFrame refer to the.In addition to simple column references and expressions, DataFrames also have a rich library of functions including string manipulation, date arithmetic, common math operations and more.

The complete list is available in the. Running SQL Queries ProgrammaticallyThe sql function on a SQLContext enables applications to run SQL queries programmatically and returns the result as a DataFrame.

// Encoders for most common types are automatically provided by importing sqlContext.implicits. val ds = Seq ( 1, 2, 3 ). Map ( + 1 ). Collect // Returns: Array(2, 3, 4) // Encoders are also created for case classes. Case class Person ( name: String, age: Long ) val ds = Seq ( Person ( 'Andy', 32 )).

ToDS // DataFrames can be converted to a Dataset by providing a class. Mapping will be done by name. Val path = 'examples/src/main/resources/people.json' val people = sqlContext. Json ( path ). As Person. JavaSparkContext sc =.; // An existing JavaSparkContext.

SQLContext sqlContext = new org. SQLContext ( sc ); Interoperating with RDDsSpark SQL supports two different methods for converting existing RDDs into DataFrames. The firstmethod uses reflection to infer the schema of an RDD that contains specific types of objects. Thisreflection based approach leads to more concise code and works well when you already know the schemawhile writing your Spark application.The second method for creating DataFrames is through a programmatic interface that allows you toconstruct a schema and then apply it to an existing RDD. While this method is more verbose, it allowsyou to construct DataFrames when the columns and their types are not known until runtime. Inferring the Schema Using Reflection. // sc is an existing SparkContext.

Val sqlContext = new org. SQLContext ( sc ) // this is used to implicitly convert an RDD to a DataFrame.

Import sqlContext.implicits. // Define the schema using a case class.

// Note: Case classes in Scala 2.10 can support only up to 22 fields. To work around this limit, // you can use custom classes that implement the Product interface. Case class Person ( name: String, age: Int ) // Create an RDD of Person objects and register it as a table.

Val people = sc. TextFile ( 'examples/src/main/resources/people.txt' ). Split ( ',' )).

Map ( p = Person ( p ( 0 ), p ( 1 ). ToDF people. RegisterTempTable ( 'people' ) // SQL statements can be run by using the sql methods provided by sqlContext.

Val teenagers = sqlContext. Sql ( 'SELECT name, age FROM people WHERE age = 13 AND age 'Name: ' + t ( 0 )). Foreach ( println ) // or by field name: teenagers.

Map ( t = 'Name: ' + t. GetAs String ( 'name' )).

Foreach ( println ) // row.getValuesMapT retrieves multiple columns at once into a MapString, T teenagers. GetValuesMap Any ( List ( 'name', 'age' ))). Foreach ( println ) // Map('name' - 'Justin', 'age' - 19).

Spark SQL can convert an RDD of Row objects to a DataFrame, inferring the datatypes. Rows are constructed by passing a list ofkey/value pairs as kwargs to the Row class.

The keys of this list define the column names of the table,and the types are inferred by looking at the first row. Since we currently only look at the firstrow, it is important that there is no missing data in the first row of the RDD. In future versions weplan to more completely infer the schema by looking at more data, similar to the inference that isperformed on JSON files. # sc is an existing SparkContext. From pyspark.sql import SQLContext, Row sqlContext = SQLContext ( sc ) # Load a text file and convert each line to a Row. TextFile ( 'examples/src/main/resources/people.txt' ) parts = lines. Map ( lambda l: l.

Split ( ',' )) people = parts. Map ( lambda p: Row ( name = p 0 , age = int ( p 1 ))) # Infer the schema, and register the DataFrame as a table. SchemaPeople = sqlContext. CreateDataFrame ( people ) schemaPeople. RegisterTempTable ( 'people' ) # SQL can be run over DataFrames that have been registered as a table. Teenagers = sqlContext. Sql ( 'SELECT name FROM people WHERE age = 13 AND age.

When case classes cannot be defined ahead of time (for example,the structure of records is encoded in a string, or a text dataset will be parsedand fields will be projected differently for different users),a DataFrame can be created programmatically with three steps. Create an RDD of Rows from the original RDD;. Create the schema represented by a StructType matching the structure ofRows in the RDD created in Step 1. Apply the schema to the RDD of Rows via createDataFrame method providedby SQLContext.For example. When JavaBean classes cannot be defined ahead of time (for example,the structure of records is encoded in a string, or a text dataset will be parsed andfields will be projected differently for different users),a DataFrame can be created programmatically with three steps. Create an RDD of Rows from the original RDD;.

Create the schema represented by a StructType matching the structure ofRows in the RDD created in Step 1. Apply the schema to the RDD of Rows via createDataFrame method providedby SQLContext.For example.

When a dictionary of kwargs cannot be defined ahead of time (for example,the structure of records is encoded in a string, or a text dataset will be parsed andfields will be projected differently for different users),a DataFrame can be created programmatically with three steps. Create an RDD of tuples or lists from the original RDD;. Create the schema represented by a StructType matching the structure oftuples or lists in the RDD created in the step 1.

Apply the schema to the RDD via createDataFrame method provided by SQLContext.For example. # Import SQLContext and data types from pyspark.sql import SQLContext from pyspark.sql.types import.

# sc is an existing SparkContext. SqlContext = SQLContext ( sc ) # Load a text file and convert each line to a tuple. TextFile ( 'examples/src/main/resources/people.txt' ) parts = lines. Map ( lambda l: l. Split ( ',' )) people = parts. Map ( lambda p: ( p 0 , p 1. Strip )) # The schema is encoded in a string.

SchemaString = 'name age' fields = StructField ( fieldname, StringType , True ) for fieldname in schemaString. Split schema = StructType ( fields ) # Apply the schema to the RDD. SchemaPeople = sqlContext. CreateDataFrame ( people, schema ) # Register the DataFrame as a table. RegisterTempTable ( 'people' ) # SQL can be run over DataFrames that have been registered as a table. Results = sqlContext. Sql ( 'SELECT name FROM people' ) # The results of SQL queries are RDDs and support all the normal RDD operations.

Names = results. Map ( lambda p: 'Name: ' + p.

Name ) for name in names. Collect : print ( name ) Data SourcesSpark SQL supports operating on a variety of data sources through the DataFrame interface.A DataFrame can be operated on as normal RDDs and can also be registered as a temporary table.Registering a DataFrame as a table allows you to run SQL queries over its data. This sectiondescribes the general methods for loading and saving data using the Spark Data Sources and thengoes into specific options that are available for the built-in data sources. Generic Load/Save FunctionsIn the simplest form, the default data source ( parquet unless otherwise configured byspark.sql.sources.default) will be used for all operations. // sqlContext from the previous example is used in this example.

// This is used to implicitly convert an RDD to a DataFrame. Import sqlContext.implicits. val people: RDD Person =. // An RDD of case class objects, from the previous example. // The RDD is implicitly converted to a DataFrame by implicits, allowing it to be stored using Parquet. Parquet ( 'people.parquet' ) // Read in the parquet file created above.

Parquet files are self-describing so the schema is preserved. // The result of loading a Parquet file is also a DataFrame. Val parquetFile = sqlContext. Parquet ( 'people.parquet' ) //Parquet files can also be registered as tables and then used in SQL statements. RegisterTempTable ( 'parquetFile' ) val teenagers = sqlContext.

Sql ( 'SELECT name FROM parquetFile WHERE age = 13 AND age 'Name: ' + t ( 0 )). Foreach ( println ). # sqlContext from the previous example is used in this example. SchemaPeople # The DataFrame from the previous example. # DataFrames can be saved as Parquet files, maintaining the schema information. Parquet ( 'people.parquet' ) # Read in the Parquet file created above.

Parquet files are self-describing so the schema is preserved. # The result of loading a parquet file is also a DataFrame. ParquetFile = sqlContext. Parquet ( 'people.parquet' ) # Parquet files can also be registered as tables and then used in SQL statements.

RegisterTempTable ( 'parquetFile' ); teenagers = sqlContext. Sql ( 'SELECT name FROM parquetFile WHERE age = 13 AND age. CREATE TEMPORARY TABLE parquetTable USING org. Parquet OPTIONS ( path 'examples/src/main/resources/people.parquet' ) SELECT. FROM parquetTable Partition DiscoveryTable partitioning is a common optimization approach used in systems like Hive.

In a partitionedtable, data are usually stored in different directories, with partitioning column values encoded inthe path of each partition directory. The Parquet data source is now able to discover and inferpartitioning information automatically. For example, we can store all our previously usedpopulation data into a partitioned table using the following directory structure, with two extracolumns, gender and country as partitioning columns. Root - name: string (nullable = true) - age: long (nullable = true) - gender: string (nullable = true) - country: string (nullable = true)Notice that the data types of the partitioning columns are automatically inferred. Currently,numeric data types and string type are supported.

Sometimes users may not want to automaticallyinfer the data types of the partitioning columns. For these use cases, the automatic type inferencecan be configured by spark.sql.sources.partitionColumnTypeInference.enabled, which is default totrue. When type inference is disabled, string type will be used for the partitioning columns.Starting from Spark 1.6.0, partition discovery only finds partitions under the given pathsby default. For the above example, if users pass path/to/table/gender=male to eitherSQLContext.read.parquet or SQLContext.read.load, gender will not be considered as apartitioning column. If users need to specify the base path that partition discoveryshould start with, they can set basePath in the data source options. For example,when path/to/table/gender=male is the path of the data andusers set basePath to path/to/table/, gender will be a partitioning column.

Schema MergingLike ProtocolBuffer, Avro, and Thrift, Parquet also supports schema evolution. Users can start witha simple schema, and gradually add more columns to the schema as needed.

In this way, users may endup with multiple Parquet files with different but mutually compatible schemas. The Parquet datasource is now able to automatically detect this case and merge schemas of all these files.Since schema merging is a relatively expensive operation, and is not a necessity in most cases, weturned it off by default starting from 1.5.0.

You may enable it by. setting data source option mergeSchema to true when reading Parquet files (as shown in theexamples below), or. setting the global SQL option spark.sql.parquet.mergeSchema to true. // sqlContext from the previous example is used in this example.

// This is used to implicitly convert an RDD to a DataFrame. Import sqlContext.implicits. // Create a simple DataFrame, stored into a partition directory val df1 = sc.

MakeRDD ( 1 to 5 ). Map ( i = ( i, i. 2 )). ToDF ( 'single', 'double' ) df1.

Parquet ( 'data/testtable/key=1' ) // Create another DataFrame in a new partition directory, // adding a new column and dropping an existing column val df2 = sc. MakeRDD ( 6 to 10 ).

Map ( i = ( i, i. 3 )).

ToDF ( 'single', 'triple' ) df2. Parquet ( 'data/testtable/key=2' ) // Read the partitioned table val df3 = sqlContext. Option ( 'mergeSchema', 'true' ). Parquet ( 'data/testtable' ) df3. PrintSchema // The final schema consists of all 3 columns in the Parquet files together // with the partitioning column appeared in the partition directory paths. // root // - single: int (nullable = true) // - double: int (nullable = true) // - triple: int (nullable = true) // - key: int (nullable = true).

# sqlContext from the previous example is used in this example. # Create a simple DataFrame, stored into a partition directory df1 = sqlContext. CreateDataFrame ( sc. Parallelize ( range ( 1, 6 )). Map ( lambda i: Row ( single = i, double = i. 2 ))) df1. Parquet ( 'data/testtable/key=1' ) # Create another DataFrame in a new partition directory, # adding a new column and dropping an existing column df2 = sqlContext.

CreateDataFrame ( sc. Parallelize ( range ( 6, 11 )). Map ( lambda i: Row ( single = i, triple = i. 3 ))) df2. Parquet ( 'data/testtable/key=2' ) # Read the partitioned table df3 = sqlContext. Option ( 'mergeSchema', 'true' ).

Parquet ( 'data/testtable' ) df3. PrintSchema # The final schema consists of all 3 columns in the Parquet files together # with the partitioning column appeared in the partition directory paths. # root # - single: int (nullable = true) # - double: int (nullable = true) # - triple: int (nullable = true) # - key: int (nullable = true). # sqlContext from the previous example is used in this example. # Create a simple DataFrame, stored into a partition directorysaveDF (df1, 'data/testtable/key=1', 'parquet', 'overwrite' ) # Create another DataFrame in a new partition directory, # adding a new column and dropping an existing columnsaveDF (df2, 'data/testtable/key=2', 'parquet', 'overwrite' ) # Read the partitioned tabledf3.

REFRESH TABLE mytable; ConfigurationConfiguration of Parquet can be done using the setConf method on SQLContext or by runningSET key=value commands using SQL. Property NameDefaultMeaningspark.sql.parquet.binaryAsStringfalseSome other Parquet-producing systems, in particular Impala, Hive, and older versions of Spark SQL, donot differentiate between binary data and strings when writing out the Parquet schema. Thisflag tells Spark SQL to interpret binary data as a string to provide compatibility with these systems.spark.sql.parquet.int96AsTimestamptrueSome Parquet-producing systems, in particular Impala and Hive, store Timestamp into INT96. Thisflag tells Spark SQL to interpret INT96 data as a timestamp to provide compatibility with these systems.spark.sql.parquet.cacheMetadatatrueTurns on caching of Parquet schema metadata.

Can speed up querying of static data.spark.sql.parquet.compression.codecgzipSets the compression codec use when writing Parquet files. Acceptable values include:uncompressed, snappy, gzip, lzo.spark.sql.parquet.filterPushdowntrueEnables Parquet filter push-down optimization when set to true.spark.sql.hive.convertMetastoreParquettrueWhen set to false, Spark SQL will use the Hive SerDe for parquet tables instead of the built insupport.spark.sql.parquet.output.committer.classorg.apache.parquet.hadoop.ParquetOutputCommitterThe output committer class used by Parquet. The specified class needs to be a subclass oforg.apache.hadoop.mapreduce.OutputCommitter. # sc is an existing SparkContext.sqlContext = 13 AND age. CREATE TEMPORARY TABLE jsonTable USING org. Json OPTIONS ( path 'examples/src/main/resources/people.json' ) SELECT. FROM jsonTable Hive TablesSpark SQL also supports reading and writing data stored in.However, since Hive has a large number of dependencies, it is not included in the default Spark assembly.Hive support is enabled by adding the -Phive and -Phive-thriftserver flags to Spark’s build.This command builds a new assembly jar that includes Hive.

Note that this Hive assembly jar must also be presenton all of the worker nodes, as they will need access to the Hive serialization and deserialization libraries(SerDes) in order to access data stored in Hive.Configuration of Hive is done by placing your hive-site.xml, core-site.xml (for security configuration),hdfs-site.xml (for HDFS configuration) file in conf/. Please note when runningthe query on a YARN cluster ( cluster mode), the datanucleus jars under the libmanaged/jars directoryand hive-site.xml under conf/ directory need to be available on the driver and all executors launched by theYARN cluster. The convenient way to do this is adding them through the -jars option and -file option of thespark-submit command.

When working with Hive one must construct a HiveContext, which inherits from SQLContext, andadds support for finding tables in the MetaStore and writing queries using HiveQL. Users who donot have an existing Hive deployment can still create a HiveContext. When not configured by thehive-site.xml, the context automatically creates metastoredb in the current directory andcreates warehouse directory indicated by HiveConf, which defaults to /user/hive/warehouse.Note that you may need to grant write privilege on /user/hive/warehouse to the user who startsthe spark application. # sc is an existing SparkContext.sqlContext. SPARKCLASSPATH =postgresql-9.3-1102-jdbc41.jar bin/spark-shellTables from the remote database can be loaded as a DataFrame or Spark SQL Temporary table usingthe Data Sources API.

The following options are supported: Property NameMeaningurlThe JDBC URL to connect to.dbtableThe JDBC table that should be read. Note that anything that is valid in a FROM clause ofa SQL query can be used. For example, instead of a full table you could also use asubquery in parentheses.driverThe class name of the JDBC driver needed to connect to this URL. This class will be loadedon the master and workers before running an JDBC commands to allow the driver toregister itself with the JDBC subsystem.partitionColumn, lowerBound, upperBound, numPartitionsThese options must all be specified if any of them is specified. They describe how topartition the table when reading in parallel from multiple workers.partitionColumn must be a numeric column from the table in question. Noticethat lowerBound and upperBound are just used to decide thepartition stride, not for filtering the rows in table. So all rows in the table will bepartitioned and returned.fetchSizeThe JDBC fetch size, which determines how many rows to fetch per round trip.

This can help performance on JDBC drivers which default to low fetch size (eg. Oracle with 10 rows). CREATE TEMPORARY TABLE jdbcTable USING org.

Jdbc OPTIONS ( url 'jdbc:postgresql:dbserver', dbtable 'schema.tablename' ) Troubleshooting. The JDBC driver class must be visible to the primordial class loader on the client session and on all executors. This is because Java’s DriverManager class does a security check that results in it ignoring all drivers not visible to the primordial class loader when one goes to open a connection. One convenient way to do this is to modify computeclasspath.sh on all worker nodes to include your driver JARs. Some databases, such as H2, convert all names to upper case.

You’ll need to use upper case to refer to those names in Spark SQL.Performance TuningFor some workloads it is possible to improve performance by either caching data in memory, or byturning on some experimental options. Caching Data In MemorySpark SQL can cache tables using an in-memory columnar format by calling sqlContext.cacheTable('tableName') or dataFrame.cache.Then Spark SQL will scan only required columns and will automatically tune compression to minimizememory usage and GC pressure.

You can call sqlContext.uncacheTable('tableName') to remove the table from memory.Configuration of in-memory caching can be done using the setConf method on SQLContext or by runningSET key=value commands using SQL. Property NameDefaultMeaningspark.sql.inMemoryColumnarStorage.compressedtrueWhen set to true Spark SQL will automatically select a compression codec for each column basedon statistics of the data.spark.sql.inMemoryColumnarStorage.batchSize10000Controls the size of batches for columnar caching. Larger batch sizes can improve memory utilizationand compression, but risk OOMs when caching data.Other Configuration OptionsThe following options can also be used to tune the performance of query execution. It is possiblethat these options will be deprecated in future release as more optimizations are performed automatically. Property NameDefaultMeaningspark.sql.autoBroadcastJoinThreshold10485760 (10 MB)Configures the maximum size in bytes for a table that will be broadcast to all worker nodes whenperforming a join.

By setting this value to -1 broadcasting can be disabled./sbin/start-thriftserver.sh -hiveconf hive.server2.thrift.port = -hiveconf hive.server2.thrift.bind.host = -master.Now you can use beeline to test the Thrift JDBC/ODBC server:./bin/beelineConnect to the JDBC/ODBC server in beeline with: beeline!connect jdbc:hive2://localhost:10000Beeline will ask you for a username and password. In non-secure mode, simply enter the username onyour machine and a blank password./sbin/start-thriftserver.sh -conf spark.sql.hive.thriftServer.singleSession = true. Upgrading From Spark SQL 1.4 to 1.5. Optimized execution using manually managed memory (Tungsten) is now enabled by default, along withcode generation for expression evaluation. These features can both be disabled by settingspark.sql.tungsten.enabled to false. Parquet schema merging is no longer enabled by default.

It can be re-enabled by settingspark.sql.parquet.mergeSchema to true. Resolution of strings to columns in python now supports using dots (.) to qualify the column oraccess nested values. For example df'table.column.nestedField'. However, this means that ifyour column name contains any dots you must now escape them using backticks (e.g., table.`column.with.dots`.nested). In-memory columnar storage partition pruning is on by default. It can be disabled by settingspark.sql.inMemoryColumnarStorage.partitionPruning to false.

Unlimited precision decimal columns are no longer supported, instead Spark SQL enforces a maximumprecision of 38. When inferring schema from BigDecimal objects, a precision of (38, 18) is nowused. When no precision is specified in DDL then the default remains Decimal(10, 0). Timestamps are now stored at a precision of 1us, rather than 1ns. In the sql dialect, floating point numbers are now parsed as decimal. HiveQL parsing remainsunchanged. The canonical name of SQL/DataFrame functions are now lower case (e.g.

Sum vs SUM). It has been determined that using the DirectOutputCommitter when speculation is enabled is unsafeand thus this output committer will not be used when speculation is on, independent of configuration.

JSON data source will not automatically load new files that are created by other applications(i.e. Files that are not inserted to the dataset through Spark SQL).For a JSON persistent table (i.e. The metadata of the table is stored in Hive Metastore),users can use REFRESH TABLE SQL command or HiveContext’s refreshTable methodto include those new files to the table. For a DataFrame representing a JSON dataset, users need to recreatethe DataFrame and the new DataFrame will include new files.Upgrading from Spark SQL 1.3 to 1.4 DataFrame data reader/writer interfaceBased on user feedback, we created a new, more fluid API for reading data in ( SQLContext.read)and writing data out ( DataFrame.write),and deprecated the old APIs (e.g. SQLContext.parquetFile, SQLContext.jsonFile).See the API docs for SQLContext.read (,) and DataFrame.write (,) more information.

DataFrame.groupBy retains grouping columnsBased on user feedback, we changed the default behavior of DataFrame.groupBy.agg to retain thegrouping columns in the resulting DataFrame. To keep the behavior in 1.3, set spark.sql.retainGroupColumns to false. // In 1.3.x, in order for the grouping column 'department' to show up, // it must be included explicitly as part of the agg function call. GroupBy ( 'department' ). Agg ( col ( 'department' ), max ( 'age' ), sum ( 'expense' )); // In 1.4+, grouping column 'department' is included automatically. GroupBy ( 'department' ). Agg ( max ( 'age' ), sum ( 'expense' )); // Revert to 1.3 behavior (not retaining grouping column) by: sqlContext.

SetConf ( 'spark.sql.retainGroupColumns', 'false' ). Import pyspark.sql.functions as func # In 1.3.x, in order for the grouping column 'department' to show up, # it must be included explicitly as part of the agg function call.

GroupBy ( 'department' ). Agg ( 'department' ), func. Max ( 'age' ), func. Sum ( 'expense' )) # In 1.4+, grouping column 'department' is included automatically. GroupBy ( 'department' ). Max ( 'age' ), func. Sum ( 'expense' )) # Revert to 1.3.x behavior (not retaining grouping column) by: sqlContext.

SetConf ( 'spark.sql.retainGroupColumns', 'false' ) Upgrading from Spark SQL 1.0-1.2 to 1.3In Spark 1.3 we removed the “Alpha” label from Spark SQL and as part of this did a cleanup of theavailable APIs. From Spark 1.3 onwards, Spark SQL will provide binary compatibility with otherreleases in the 1.X series. This compatibility guarantee excludes APIs that are explicitly markedas unstable (i.e., DeveloperAPI or Experimental). Rename of SchemaRDD to DataFrameThe largest change that users will notice when upgrading to Spark SQL 1.3 is that SchemaRDD hasbeen renamed to DataFrame.

This is primarily because DataFrames no longer inherit from RDDdirectly, but instead provide most of the functionality that RDDs provide though their ownimplementation. DataFrames can still be converted to RDDs by calling the.rdd method.In Scala there is a type alias from SchemaRDD to DataFrame to provide source compatibility forsome use cases. It is still recommended that users update their code to use DataFrame instead.Java and Python users will need to update their code. Unification of the Java and Scala APIsPrior to Spark 1.3 there were separate Java compatible classes ( JavaSQLContext and JavaSchemaRDD)that mirrored the Scala API.

In Spark 1.3 the Java API and Scala API have been unified. Usersof either language should use SQLContext and DataFrame. In general theses classes try touse types that are usable from both languages (i.e. Array instead of language specific collections).In some cases where no common type exists (e.g., for passing in closures or Maps) function overloadingis used instead.Additionally the Java specific types API has been removed. Users of both Scala and Java shoulduse the classes present in org.apache.spark.sql.types to describe schema programmatically. Isolation of Implicit Conversions and Removal of dsl Package (Scala-only)Many of the code examples prior to Spark 1.3 started with import sqlContext., which broughtall of the functions from sqlContext into scope.

In Spark 1.3 we have isolated the implicitconversions for converting RDDs into DataFrames into an object inside of the SQLContext.Users should now write import sqlContext.implicits.Additionally, the implicit conversions now only augment RDDs that are composed of Products (i.e.,case classes or tuples) with a method toDF, instead of applying automatically.When using function inside of the DSL (now replaced with the DataFrame API) users used to importorg.apache.spark.sql.catalyst.dsl. Instead the public dataframe functions API should be used:import org.apache.spark.sql.functions. Removal of the type aliases in org.apache.spark.sql for DataType (Scala-only)Spark 1.3 removes the type aliases that were present in the base sql package for DataType. Usersshould instead import the classes in org.apache.spark.sql.types UDF Registration Moved to sqlContext.udf (Java & Scala)Functions that are used to register UDFs, either for use in the DataFrame DSL or SQL, have beenmoved into the udf object in SQLContext. All data types of Spark SQL are located in the package oforg.apache.spark.sql.types. To access or create a data type,please use factory methods provided inorg.apache.spark.sql.types.DataTypes.

Data typeValue type in RAPI to access or create a data typeByteTypeintegerNote: Numbers will be converted to 1-byte signed integer numbers at runtime.Please make sure that numbers are within the range of -128 to 127.' Byte'ShortTypeintegerNote: Numbers will be converted to 2-byte signed integer numbers at runtime.Please make sure that numbers are within the range of -32768 to 32767.' Short'IntegerTypeinteger'integer'LongTypeintegerNote: Numbers will be converted to 8-byte signed integer numbers at runtime.Please make sure that numbers are within the range of-854775808 to 854775807.Otherwise, please convert data to decimal.Decimal and use DecimalType.' Long'FloatTypenumericNote: Numbers will be converted to 4-byte single-precision floatingpoint numbers at runtime.' Float'DoubleTypenumeric'double'DecimalTypeNot supportedNot supportedStringTypecharacter'string'BinaryTyperaw'binary'BooleanTypelogical'bool'TimestampTypePOSIXct'timestamp'DateTypeDate'date'ArrayTypevector or listlist(type='array', elementType= elementType, containsNull= containsNull)Note: The default value of containsNull is True.MapTypeenvironmentlist(type='map', keyType= keyType, valueType= valueType, valueContainsNull= valueContainsNull)Note: The default value of valueContainsNull is True.StructTypenamed listlist(type='struct', fields= fields)Note: fields is a Seq of StructFields.

Comments are closed.