Hive: Struct

This tutorial will show you how to use struct. If you have no installed Hive yet please follow this tutorial.

Create Table with Struct:

  1. CREATE TABLE test_struct (
  2. columnA STRING,
  3. columnB VARCHAR(15),
  4. columnC INT,
  5. columnD TIMESTAMP,
  6. columnE DATE,
  7. columnF STRUCT<key:STRING, value:INT>
  8. )
  9. STORED AS ORC;

Insert Data:

  1. INSERT INTO test_struct
  2. SELECT '1', '2', 1, '2019-02-07 20:58:27', '2019-02-07', NAMED_STRUCT('key', 'val', 'value', 1);

Select Data:
This will give back the value of “key” and “value” in columnF.

  1. SELECT columnF.key, columnF.value
  2. FROM test_struct;

Hive: Map

This tutorial will show you how to use map. If you have no installed Hive yet please follow this tutorial.

Create Table with Map:

  1. CREATE TABLE test_map (
  2. columnA STRING,
  3. columnB VARCHAR(15),
  4. columnC INT,
  5. columnD TIMESTAMP,
  6. columnE DATE,
  7. columnF MAP<STRING, INT>
  8. )
  9. STORED AS ORC;

Insert Data:

  1. INSERT INTO test_map
  2. SELECT '1', '2', 1, '2019-02-07 20:58:27', '2019-02-07', MAP('Val', 1);

Select Data:
This will give back the value of “Val” in columnF.

  1. SELECT columnF['Val']
  2. FROM test_map;

Hive: Misc

This tutorial will show you some misc usage for working with Hive. If you have no installed Hive yet please follow this tutorial.

NVL:
Check if value is null then substitute other value.

  1. SELECT NVL(columnA, 'was null')
  2. FROM test;

CAST:
If columnE was a date and you wanted it to be a string.

  1. SELECT CAST(columnE AS STRING)
  2. FROM test;

Concat:
This will concat the strings together giving “Test the code!”

  1. SELECT CONCAT('Test', ' the ', 'code!');

CONCAT_WS:
This will concat the strings together starting at “Test” and use the first position as the seperator giving “Test the code”

  1. SELECT CONCAT_WS(' ', 'Test', 'the', 'code');

MIN:
This will give the minimum value of columnC

  1. SELECT MIN(columnC) AS Max_ColumnC
  2. FROM test;

MAX:
This will give the maximum value of columnC

  1. SELECT MAX(columnC) AS Max_ColumnC
  2. FROM test;

DISTINCT:
This will select the distinct columnA and columnB

  1. SELECT DISTINCT columnA, columnB
  2. FROM test;

CASE:

  1. SELECT CASE WHEN columnA = 'val' THEN 'Is Val' ELSE 'Not Val' END
  2. FROM test;

COLLECT_SET:
This will collect all values in columnC and select the first index.

  1. SELECT COLLECT_SET(columnC)[0]
  2. FROM test;

CURRENT_DATE:
This will give you the current date in the format YYYY-MM-dd.

  1. SELECT CURRENT_DATE();

UNIX_TIMESTAMP:
This will give you the current timestamp from EPOCH. IE: 1549591492

  1. SELECT UNIX_TIMESTAMP();

FROM_UNIXTIME:
This will take a timestamp and display it in the format YYYY-MM-dd HH:MM:SS.

  1. SELECT FROM_UNIXTIME(1549591492);

TO_DATE:
This will convert a date to YYYY-MM-dd.

  1. SELECT TO_DATE('2019-02-01 01:01:01');

YEAR:

  1. SELECT YEAR(columnE)
  2. FROM test;

MONTH:

  1. SELECT MONTH(columnE)
  2. FROM test;

DAY:

  1. SELECT DAY(columnE)
  2. FROM test;

DATE_ADD:

  1. SELECT DATE_ADD('2019-01-01', 4);

UPPER:
This will upper case columnA.

  1. SELECT UPPER(columnA)
  2. FROM test;

LOWER:
This will lower case columnA.

  1. SELECT LOWER(columnA)
  2. FROM test;

TRIM:
This will trim leading and trailing spaces in columnA.

  1. SELECT TRIM(columnA)
  2. FROM test;

LITERALS:
If a column contains a space you will need to use literal in order to use the AS keyword or when you are defining it in the create table command. Although I don’t recommend this it is possible.

  1. SELECT columnA AS `test column`
  2. FROM test;

 

 

 

Hive: Tables

This tutorial will show you some common usage for working with tables. If you have no installed Hive yet please follow this tutorial.

Show Tables:

  1. SHOW TABLES;
    SHOW TABLES LIKE '*test*';

Table Creation:

  1. CREATE TABLE test (
    columnA STRING,
    columnB VARCHAR(15),
    columnC INT,
    columnD TIMESTAMP,
    columnE DATE
    )
    STORED AS ORC;

Table Creation with Partitioning:

  1. CREATE TABLE test_partition (
    columnA STRING,
    columnB VARCHAR(15),
    columnC INT,
    columnD TIMESTAMP,
    columnE DATE
    )
    PARTITIONED BY (columnF INT)
    STORED AS ORC;

Inline Table Creation:

  1. CREATE TABLE test_inline STORED AS ORC AS
    SELECT *
    FROM test;

Temporary Table Creation:

  1. CREATE TEMPORARY TABLE temp (
    columnA STRING,
    columnB VARCHAR(15),
    columnC INT,
    columnD TIMESTAMP,
    columnE DATE
    )
    STORED AS ORC;

DESC Table:
This will show you the basic definition of a table.

  1. DESC test;

DESC EXTENDED Table:
This will show you the extended definition of a table.

  1. DESC EXTENDED test;

Drop Table:

  1. DROP TABLE IF EXISTS temp;

Hive & Java: Connect to Remote Kerberos Hive using KeyTab

In this tutorial I will show you how to connect to remote Kerberos Hive cluster using Java. If you haven’t install Hive yet follow the tutorial.

Import SSL Cert to Java:

Follow this tutorial to “Installing unlimited strength encryption Java libraries

If on Windows do the following

  1. #Import it
  2. "C:\Program Files\Java\jdk1.8.0_171\bin\keytool" -import -file hadoop.csr -keystore "C:\Program Files\Java\jdk1.8.0_171\jre\lib\security\cacerts" -alias "hadoop"
  3.  
  4. #Check it
  5. "C:\Program Files\Java\jdk1.8.0_171\bin\keytool" -list -v -keystore "C:\Program Files\Java\jdk1.8.0_171\jre\lib\security\cacerts"
  6.  
  7. #If you want to delete it
  8. "C:\Program Files\Java\jdk1.8.0_171\bin\keytool" -delete -alias hadoop -keystore "C:\Program Files\Java\jdk1.8.0_171\jre\lib\security\cacerts"

POM.xml:

  1. <dependency>
  2. <groupId>org.apache.hive</groupId>
  3. <artifactId>hive-jdbc</artifactId>
  4. <version>2.3.3</version>
  5. <exclusions>
  6. <exclusion>
  7. <groupId>jdk.tools</groupId>
  8. <artifactId>jdk.tools</artifactId>
  9. </exclusion>
  10. </exclusions>
  11. </dependency>

Imports:

  1. import org.apache.hadoop.conf.Configuration;
  2. import org.apache.hadoop.security.UserGroupInformation;
  3. import java.sql.SQLException;
  4. import java.sql.Connection;
  5. import java.sql.ResultSet;
  6. import java.sql.Statement;
  7. import java.sql.DriverManager;

Connect:

  1. // Setup the configuration object.
  2. final Configuration config = new Configuration();
  3.  
  4. config.set("fs.defaultFS", "swebhdfs://hadoop:50470");
  5. config.set("hadoop.security.authentication", "kerberos");
  6. config.set("hadoop.rpc.protection", "integrity");
  7.  
  8. System.setProperty("https.protocols", "TLSv1,TLSv1.1,TLSv1.2");
  9. System.setProperty("java.security.krb5.conf", "C:\\Program Files\\Java\\jdk1.8.0_171\\jre\\lib\\security\\krb5.conf");
  10. System.setProperty("java.security.krb5.realm", "REALM.CA");
  11. System.setProperty("java.security.krb5.kdc", "REALM.CA");
  12. System.setProperty("sun.security.krb5.debug", "true");
  13. System.setProperty("javax.net.debug", "all");
  14. System.setProperty("javax.net.ssl.keyStorePassword","changeit");
  15. System.setProperty("javax.net.ssl.keyStore","C:\\Program Files\\Java\\jdk1.8.0_171\\jre\\lib\\security\\cacerts");
  16. System.setProperty("javax.net.ssl.trustStore", "C:\\Program Files\\Java\\jdk1.8.0_171\\jre\\lib\\security\\cacerts");
  17. System.setProperty("javax.net.ssl.trustStorePassword","changeit");
  18. System.setProperty("javax.security.auth.useSubjectCredsOnly", "false");
  19.  
  20. UserGroupInformation.setConfiguration(config);
  21. UserGroupInformation.setLoginUser(UserGroupInformation.loginUserFromKeytabAndReturnUGI("hive/hadoop@REALM.CA", "c:\\data\\hive.service.keytab"));
  22.  
  23. System.out.println(UserGroupInformation.getLoginUser());
  24. System.out.println(UserGroupInformation.getCurrentUser());
  25.  
  26. //Add the hive driver
  27. Class.forName("org.apache.hive.jdbc.HiveDriver");
  28.  
  29. //Connect to hive jdbc
  30. Connection connection = DriverManager.getConnection("jdbc:hive2://hadoop:10000/default;principal=hive/hadoop@REALM.CA");
  31. Statement statement = connection.createStatement();
  32.  
  33. //Create a table
  34. String createTableSql = "CREATE TABLE IF NOT EXISTS "
  35. +" employee ( eid int, name String, "
  36. +" salary String, designation String)"
  37. +" COMMENT 'Employee details'"
  38. +" ROW FORMAT DELIMITED"
  39. +" FIELDS TERMINATED BY '\t'"
  40. +" LINES TERMINATED BY '\n'"
  41. +" STORED AS TEXTFILE";
  42.  
  43. System.out.println("Creating Table: " + createTableSql);
  44. statement.executeUpdate(createTableSql);
  45.  
  46. //Show all the tables to ensure we successfully added the table
  47. String showTablesSql = "show tables";
  48. System.out.println("Show All Tables: " + showTablesSql);
  49. ResultSet res = statement.executeQuery(showTablesSql);
  50.  
  51. while (res.next()) {
  52. System.out.println(res.getString(1));
  53. }
  54.  
  55. //Drop the table
  56. String dropTablesSql = "DROP TABLE IF EXISTS employee";
  57.  
  58. System.out.println("Dropping Table: " + dropTablesSql);
  59. statement.executeUpdate(dropTablesSql);
  60.  
  61. System.out.println("Finish!");

Hive Kerberos Installation

We are going to install Hive over Hadoop and perform a basic query. Ensure you install Kerberos and Hadoop with Kerberos.

This assumes your hostname is “hadoop”

Download Hive:

  1. wget http://apache.forsale.plus/hive/hive-2.3.3/apache-hive-2.3.3-bin.tar.gz
  2. tar -xzf apache-hive-2.3.3-bin.tar.gz
  3. sudo mv apache-hive-2.3.3-bin /usr/local/hive
  4. sudo chown -R root:hadoopuser /usr/local/hive/

Setup .bashrc:

  1. sudo nano ~/.bashrc

Add the following to the end of the file.

#HIVE VARIABLES START
export HIVE_HOME=/usr/local/hive
export HIVE_CONF_DIR=/usr/local/hive/conf
export PATH=$PATH:$HIVE_HOME/bin
export CLASSPATH=$CLASSPATH:/usr/local/hadoop/lib/*:/usr/local/hive/lib/*
#HIVE VARIABLES STOP

  1. source ~/.bashrc

Create warehouse on hdfs

  1. kinit -kt /etc/security/keytabs/myuser.keytab myuser/hadoop@REAL.CA
  2. hdfs dfs -mkdir -p /user/hive/warehouse
  3. hdfs dfs -mkdir /tmp
  4. hdfs dfs -chmod g+w /tmp
  5. hdfs dfs -chmod g+w /user/hive/warehouse

Create Kerberos Principals

  1. cd /etc/security/keytabs
  2. sudo kadmin.local
  3. addprinc -randkey hive/hadoop@REALM.CA
  4. addprinc -randkey hivemetastore/hadoop@REALM.CA
  5. addprinc -randkey hive-spnego/hadoop@REALM.CA
  6. xst -kt hive.service.keytab hive/hadoop@REALM.CA
  7. xst -kt hivemetastore.service.keytab hivemetastore/hadoop@REALM.CA
  8. xst -kt hive-spnego.service.keytab hive-spnego/hadoop@REALM.CA
  9. q

Set Keytab Permissions/Ownership

  1. sudo chown root:hadoopuser /etc/security/keytabs/*
  2. sudo chmod 750 /etc/security/keytabs/*

hive-env.sh

  1. cd $HIVE_HOME/conf
  2. sudo cp hive-env.sh.template hive-env.sh
  3.  
  4. sudo nano /usr/local/hive/conf/hive-env.sh
  5.  
  6. #locate "HADOOP_HOME" and change to be this
  7. export HADOOP_HOME=/usr/local/hadoop
  8.  
  9. #locate "HIVE_CONF_DIR" and change to be this
  10. export HIVE_CONF_DIR=/usr/local/hive/conf

hive-site.xml

Chekck out this link for the configuration properties.

  1. sudo cp /usr/local/hive/conf/hive-default.xml.template /usr/local/hive/conf/hive-site.xml
  2.  
  3. sudo nano /usr/local/hive/conf/hive-site.xml
  4.  
  5. #Modify the following properties
  6.  
  7. <property>
  8. <name>system:user.name</name>
  9. <value>${user.name}</value>
  10. </property>
  11. <property>
  12. <name>javax.jdo.option.ConnectionURL</name>
  13. <value>jdbc:postgresql://myhost:5432/metastore</value>
  14. </property>
  15. <property>
  16. <name>javax.jdo.option.ConnectionDriverName</name>
  17. <value>org.postgresql.Driver</value>
  18. </property>
  19. <property>
  20. <name>hive.metastore.warehouse.dir</name>
  21. <value>/user/hive/warehouse</value>
  22. </property>
  23. <property>
  24. <name>javax.jdo.option.ConnectionUserName</name>
  25. <value>hiveuser</value>
  26. </property>
  27. <property>
  28. <name>javax.jdo.option.ConnectionPassword</name>
  29. <value>PASSWORD</value>
  30. </property>
  31. <property>
  32. <name>hive.exec.local.scratchdir</name>
  33. <value>/tmp/${system:user.name}</value>
  34. <description>Local scratch space for Hive jobs</description>
  35. </property>
  36. <property>
  37. <name>hive.querylog.location</name>
  38. <value>/tmp/${system:user.name}</value>
  39. <description>Location of Hive run time structured log file</description>
  40. </property>
  41. <property>
  42. <name>hive.downloaded.resources.dir</name>
  43. <value>/tmp/${hive.session.id}_resources</value>
  44. <description>Temporary local directory for added resources in the remote file system.</description>
  45. </property>
  46. <property>
  47. <name>hive.server2.logging.operation.log.location</name>
  48. <value>/tmp/${system:user.name}/operation_logs</value>
  49. <description>Top level directory where operation logs are stored if logging functionality is enabled</description>
  50. </property>
  51. <property>
  52. <name>hive.metastore.uris</name>
  53. <value>thrift://0.0.0.0:9083</value>
  54. <description>IP address (or fully-qualified domain name) and port of the metastore host</description>
  55. </property>
  56. <property>
  57. <name>hive.server2.webui.host</name>
  58. <value>0.0.0.0</value>
  59. </property>
  60. <property>
  61. <name>hive.server2.webui.port</name>
  62. <value>10002</value>
  63. </property>
  64. <property>
  65. <name>hive.metastore.port</name>
  66. <value>9083</value>
  67. </property>
  68. <property>
  69. <name>hive.server2.transport.mode</name>
  70. <value>binary</value>
  71. </property>
  72. <property>
  73. <name>hive.server2.thrift.sasl.qop</name>
  74. <value>auth-int</value>
  75. </property>
  76. <property>
  77. <name>hive.server2.authentication</name>
  78. <value>KERBEROS</value>
  79. <description>authenticationtype</description>
  80. </property>
  81. <property>
  82. <name>hive.server2.authentication.kerberos.principal</name>
  83. <value>hive/_HOST@REALM.CA</value>
  84. <description>HiveServer2 principal. If _HOST is used as the FQDN portion, it will be replaced with the actual hostname of the running instance.</description>
  85. </property>
  86. <property>
  87. <name>hive.server2.authentication.kerberos.keytab</name>
  88. <value>/etc/security/keytabs/hive.service.keytab</value>
  89. <description>Keytab file for HiveServer2 principal</description>
  90. </property>
  91. <property>
  92. <name>hive.metastore.sasl.enabled</name>
  93. <value>true</value>
  94. <description>If true, the metastore thrift interface will be secured with SASL. Clients
  95. must authenticate with Kerberos.</description>
  96. </property>
  97. <property>
  98. <name>hive.metastore.kerberos.keytab.file</name>
  99. <value>/etc/security/keytabs/hivemetastore.service.keytab</value>
  100. <description>The path to the Kerberos Keytab file containing the metastore thrift
  101. server's service principal.</description>
  102. </property>
  103. <property>
  104. <name>hive.metastore.kerberos.principal</name>
  105. <value>hivemetastore/_HOST@REALM.CA</value>
  106. <description>The service principal for the metastore thrift server. The special string _HOST will be replaced automatically with the correct host name.</description>
  107. </property>
  108. <property>
  109. <name>hive.security.authorization.enabled</name>
  110. <value>true</value>
  111. <description>enable or disable the hive client authorization</description>
  112. </property>
  113. <property>
  114. <name>hive.metastore.pre.event.listeners</name>
  115. <value>org.apache.hadoop.hive.ql.security.authorization.AuthorizationPreEventListener</value>
  116. <description>List of comma separated listeners for metastore events.</description>
  117. </property>
  118. <property>
  119. <name>hive.security.metastore.authorization.manager</name>
  120. <value>org.apache.hadoop.hive.ql.security.authorization.StorageBasedAuthorizationProvider</value>
  121. <description>
  122. Names of authorization manager classes (comma separated) to be used in the metastore
  123. for authorization. The user defined authorization class should implement interface
  124. org.apache.hadoop.hive.ql.security.authorization.HiveMetastoreAuthorizationProvider.
  125. All authorization manager classes have to successfully authorize the metastore API
  126. call for the command execution to be allowed.
  127. </description>
  128. </property>
  129. <property>
  130. <name>hive.security.metastore.authenticator.manager</name>
  131. <value>org.apache.hadoop.hive.ql.security.HadoopDefaultMetastoreAuthenticator</value>
  132. <description>
  133. authenticator manager class name to be used in the metastore for authentication.
  134. The user defined authenticator should implement interface org.apache.hadoop.hive.ql.security.HiveAuthenticationProvider.
  135. </description>
  136. </property>
  137. <property>
  138. <name>hive.security.metastore.authorization.auth.reads</name>
  139. <value>true</value>
  140. <description>If this is true, metastore authorizer authorizes read actions on database, table</description>
  141. </property>
  142. <property>
  143. <name>datanucleus.autoCreateSchema</name>
  144. <value>false</value>
  145. </property>

Hadoop core-site.xml

Notice here how it’s .hive. that is used with the storage based authentication.

  1. sudo nano /usr/local/hadoop/etc/hadoop/core-site.xml
  2.  
  3. <property>
  4. <name>hadoop.proxyuser.hive.hosts</name>
  5. <value>*</value>
  6. </property>
  7. <property>
  8. <name>hadoop.proxyuser.hive.groups</name>
  9. <value>*</value>
  10. </property>

Install Postgres 9.6

Follow this install for installing Psotgresql 9.6

  1. sudo su - postgres
  2. psql
  3.  
  4. CREATE USER hiveuser WITH PASSWORD 'PASSWORD';
  5. CREATE DATABASE metastore;
  6. GRANT ALL PRIVILEGES ON DATABASE metastore TO hiveuser;
  7. \q
  8. exit

Initiate Postgres Schema

  1. schematool -dbType postgres -initSchema

Start Metastore & HiveServer2

  1. nohup /usr/local/hive/bin/hive --service metastore --hiveconf hive.log.file=hivemetastore.log >/var/log/hive/hivemetastore.out 2>/var/log/hive/hivemetastoreerr.log &
  2.  
  3. nohup /usr/local/hive/bin/hiveserver2 --hiveconf hive.metastore.uris=" " --hiveconf hive.log.file=hiveserver2.log >/var/log/hive/hiveserver2.out 2> /var/log/hive/hiveserver2err.log &

Auto Start

  1. sudo mkdir /var/log/hive/
  2. sudo chown root:hduser /var/log/hive
  3. sudo chmod 777 /var/log/hive
  4.  
  5. crontab -e
  6.  
  7. #Add the following
  8. @reboot nohup /usr/local/hive/bin/hive --service metastore --hiveconf hive.log.file=hivemetastore.log >/var/log/hive/hivemetastore.out 2>/var/log/hive/hivemetastoreerr.log &
  9. @reboot nohup /usr/local/hive/bin/hiveserver2 --hiveconf hive.metastore.uris=" " --hiveconf hive.log.file=hiveserver2.log >/var/log/hive/hiveserver2.out 2> /var/log/hive/hiveserver2err.log &

Now you can check the hive version

  1. hive --version

Hive Web URL

http://hadoop:10002/

Beeline

  1. #We first need to have a ticket to access beeline using the hive kerberos user we setup earlier.
  2. kinit -kt /etc/security/keytabs/hive.service.keytab hive/hadoop@REALM.CA
  3.  
  4. #Now we can get into beeline using that principal
  5. beeline -u "jdbc:hive2://0.0.0.0:10000/default;principal=hive/hadoop@REALM.CA;"
  6.  
  7. #You can also just get into beeline then connect from there
  8. beeline
  9. beeline>!connect jdbc:hive2://0.0.0.0:10000/default;principal=hive/hadoop@REALM.CA
  10.  
  11. #Disconnect from beeline
  12. !q

References

http://www.bogotobogo.com/Hadoop/BigData_hadoop_Hive_Install_On_Ubuntu_16_04.php
https://maprdocs.mapr.com/home/Hive/Config-RemotePostgreSQLForHiveMetastore.html
https://cwiki.apache.org/confluence/display/Hive/Hive+Schema+Tool#HiveSchemaTool-TheHiveSchemaTool
https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-InstallationandConfiguration