Beginner’s guide to maintaining a Hive schema with external data storage and executing Hive queries using Python
Published in · 6 min read · Dec 30, 2019
--
In this article, I’m going to share my experience of maintaining a Hive schema. This will be useful to the freshers who are willing to step into Big Data technologies. Mainly this will describe how to connect to Hive using python and How to use AWS S3 as the data storage. If you are not familiar with Hive concepts please go through the article on Wikipedia.
The main objective of this article is to provide a guide to connect Hive through python and execute queries. I’m using “Pyhive” library for that. I’m creating my connection class as “HiveConnection” and Hive queries will be passed into the functions. AWS S3 will be used as the file storage for Hive tables.
import pandas as pd
from pyhive import hiveclass HiveConnection:
@staticmethod
def select_query(query_str: str, database:str =HIVE_SCHEMA) -> pd.DataFrame:
"""
Execute a select query which returns a result set
:param query_str: select query to be executed
:param database: Hive Schema
:return:
"""
conn = hive.Connection(host=HIVE_URL, port=HIVE_PORT, database=database, username=HIVE_USER)try:
result = pd.read_sql(query_str, conn)
return result
finally:
conn.close()@staticmethod
def execute_query(query_str: str, database: str=HIVE_SCHEMA):
"""
execute an query which does not return a result set.
ex: INSERT, CREATE, DROP, ALTER TABLE
:param query_str: Hive query to be executed
:param database: Hive Schema
:return:
"""
conn = hive.Connection(host=HIVE_URL, port=HIVE_PORT, database=database, username=HIVE_USER)
cur = conn.cursor()
# Make sure to set the staging default to HDFS to avoid some potential S3 related errors
cur.execute("SET hive.exec.stagingdir=/tmp/hive/")
cur.execute("SET hive.exec.scratchdir=/tmp/hive/")
try:
cur.execute(query_str)
return "SUCCESS"
finally:
conn.close()
I’m keeping the queries as separated strings. This way you can format the queries with external parameters when necessary. Hive configurations (HIVE_URL, HIVE_PORT, HIVE_USER, HIVE_SCHEMA) as constants. Function “select_query” will be use to retrieve data and function “execute_query” will be used for other queries.
Hive provides a shell interactive tool to initiate databases, tables and manipulate the data in tables. We can go into the Hive command line by typing command “hive”. You can execute all the queries given in this article in the shell also.
Schema is a collection of tables which is similar to a database. Both keywords SCHEMA and DATABASE are allowed in Hive. We can pick either. Here we use SCHEMA instead of DATABASE. Schema can be created with “CREATE SCHEMA”. To go inside the schema, the keyword “USE” is available.
CREATE SCHEMA userdb;
USE userdb;
There are three types of Hive tables. They are Internal, External and Temporary. Internal tables store metadata of the table inside the database as well as the table data. But external tables store metadata inside the database while table data is stored in a remote location like AWS S3 and hdfs. When dropping an internal table, all the table data will be erased with the metadata. When dropping an external table, only the metadata will be erased; not the table data. In this way, actual data will be protected. If you point a new table to the same location, data will be visible through the new table.
Hive is a data warehouse and uses MapReduce Framework. So the speed of the data retrieving may not fair enough for small queries. Hive tables can be partitioned in order to increase the performance. Partitioning technique can be applied to both external and internal tables. Concepts like bucketing are also there. You can choose any of these techniques to enhance performance.
Temporary tables are useful when copying data from one place to another. It acts as a temporary location to hold the data within a database session. All the temporary tables are cleared after the session timeout. Creating a temporary table is not useful with “Pyhive” library as multiple queries are not supported in a single session. Even though we created a table, the same session will no longer be available to access the table. But this is possible in the Hive command line. You can create a temporary table and then select data from that table in a single session.
Internal Tables
The following query is to create an internal table with a remote data storage, AWS S3. The file format is CSV and field are terminated by a comma. “s3_location” points to the S3 directory where the data files are. This is a user-defined external parameter for the query string. It should be passed in the time of query formatting.
CREATE TABLE `user_info` (
`business_unit` INT,
`employee_id` INT,
`email` VARCHAR(250),
`first_name` VARCHAR(250),
`last_name` VARCHAR(250),
`gender` VARCHAR(250),
`birthday` DATE,
`created_date` DATE,
`updated_date` DATE
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ESCAPED BY '\\'
LOCATION '{s3_location}'
TBLPROPERTIES (
"s3select.format" = "csv",
"skip.header.line.count" = "1"
);
If the data strings contain commas, it will break the table structure. So I have defined an escape character and all the unnecessary commas needed to be preceded by this escape character before creating the table.
Following is an example record. Note that email contains a comma.
1,1,ann,smith@gamil.com,Ann,Smith,female,'1992–07–01','2019–09–01','2019–12–31'
above record need to be formatted like this :
1,1,ann\\,smith@gamil.com,Ann,Smith,female,'1992–07–01','2019–09–01','2019–12–31'
External Tables
Here, I have partitioned “user_info” table with “business_unit” and “created_date”
CREATE EXTERNAL TABLE `user_info` (
`employee_id` INT,
`email` VARCHAR(250),
`first_name` VARCHAR(250),
`last_name` VARCHAR(250),
`gender` VARCHAR(250),
`birthday` DATE,
`updated_date` DATE
) partitioned by(
`business_unit` INT,
`created_date` DATE,
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ESCAPED BY '\\'
STORED AS
INPUTFORMAT
'com.amazonaws.emr.s3select.hive.S3SelectableTextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION '{s3_location}'
TBLPROPERTIES (
"s3select.format" = "csv",
"s3select.headerInfo" = "ignore"
);
Temporary table
Query for creating a temporary table.
CREATE TEMPORARY TABLE `user_info` (
`business_unit` INT,
`employee_id` VARCHAR(250),
`email` VARCHAR(250),
`first_name` VARCHAR(250),
`last_name` VARCHAR(250),
`gender` VARCHAR(250),
`birthday` DATE,
`created_date` DATE,
`updated_date` DATE
) ;
Drop Table
Query to drop a table. If you are dropping an external table data in remote file storage will not be erased.
DROP TABLE IF EXISTS `user_info`;
Insert data
Once the table is created with an external file storage, data in the remote location will be visible through a table with no partition. But this is not true when it comes to a table with partitions. Which means data can not be directly copied into a partitioned table. We need to create a temporary table with no partition and insert data into the partitioned table by providing the partition values. The following query describes how to insert records to such a table.
INSERT INTO TABLE user_static_info PARTITION (business_unit={business_unit}, `created_date`='{execution_date}')
SELECT
Employee_id,
email,
secondary_email,
first_name,
last_name,
orig_gender,
gender,
signup_channel ,
signup_from_fb ,
birthday,
signup_date,
updated_date,
last_activity_date,
subscription_status
FROM
tmp_user_static_info
WHERE business_id={business_unit}
Since “Pyhive” is not supported for multiple queries in a single session; I had to create the internal table “tmp_user_static_info” which points to S3 data directory without partitions. Then it was dropped after inserting data to the external, partitioned table.
Retrieve data
SELECT queries are used to retrieve data in Hive. These are much similar to SQL SELECT queries. It has the following form. You can build the query for your requirements.
SELECT [ALL | DISTINCT] select_expr, select_expr, …
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[HAVING having_condition]
[CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list]]
[LIMIT number];
Update and Delete data
Hive does not support UPDATE and DELETE data directly. If you want to change anything from a table; copy the necessary data to a new table with SELECT queries. Then you can replace the old table with a new table by dropping the old table and renaming the new table.
Alter tables
Table alterations are possible in Hive. But this needs to be done very carefully without affecting the existing data. Because we can’t alter the data. As an example, adding a new field in the middle will not shift data. If we add a new field as the second field, data that belong to the third column will still appear in the second column and fourth field data in the 3rd field and so on. The last field will not contain any data. This is because of the restriction of updating hive table data. If we added a new field as the last field, there will be an empty field and we can insert data into that field.
ALTER TABLE user_static_info ADD COLUMNS (last_sign_in DATE);
If we want to drop external data we can use the following steps.
ALTER TABLE user_static_info SET TBLPROPERTIES('EXTERNAL'='False');
DROP TABLE user_static_info;
Finally, the following code shows how to execute a query using “execute_query” function in “HiveConnection” class.
from src.database.hive_con import HiveConnectioncreate_temp_table_query_str = """CREATE TABLE `user_info` (
`business_unit` INT,
`employee_id` INT,
`email` VARCHAR(250),
`first_name` VARCHAR(250),
`last_name` VARCHAR(250),
`gender` VARCHAR(250),
`birthday` DATE,
`created_date` DATE,
`updated_date` DATE
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ESCAPED BY '\\'
LOCATION '{s3_location}'
TBLPROPERTIES (
"s3select.format" = "csv",
"skip.header.line.count" = "1"
);""".format(
s3_location="s3://hive-test/data/user_info/"
)
HiveConnection.execute_query(query_str=create_temp_table_query_str, database=userdb)
FAQs
Can Hive work with S3? ›
The Hive connector can read and write tables that are stored in Amazon S3 or S3-compatible systems. This is accomplished by having a table or database location that uses an S3 prefix, rather than an HDFS prefix.
Can you use Python with Hive? ›In the era of BigData, Hive is an invaluable tool and having it in conjunction with flexible programming like Python gives you an extra layer of manoeuvrability when building your next BigData project.
How to pull data from aws S3 using python? ›- Step 1: Setup an account. ...
- Step 2: Create a user. ...
- Step 3: Create a bucket. ...
- Step 4: Create a policy and add it to your user. ...
- Step 5: Download AWS CLI and configure your user. ...
- Step 6: Upload your files. ...
- Step 7: Check if authentication is working.
- from pyhive import hive.
- import pandas as pd.
- #Create Hive connection.
- conn = hive.Connection(host="127.0.0.1", port=10000, username="username")
- # Read Hive table and Create pandas dataframe.
- df = pd.read_sql("SELECT * FROM db_Name.table_Name limit 10", conn)
- print(df.head())
“As a smart tech brand in the middle of a climate crisis, we know our focus needs to change,” the page reads, “So we've made the tough decision to discontinue our smart security and leak detection products – and develop smart home tech that'll get us closer to Net Zero.”
How do I create a Hive table on S3? ›To create a Hive table on top of those files, you have to specify the structure of the files by giving columns names and types. CREATE EXTERNAL TABLE posts (title STRING, comment_count INT) LOCATION 's3://my-bucket/files/'; Here is a list of all types allowed.
What is the main disadvantage of Hive? ›Limitation of Hive
It does not offer real-time queries for row-level updates. The latency in the apache hive query is very high. Hive only supported online analytical processing (OLAP) and doesn't support online transaction processing (OLTP). Hive Query Language doesn't support the transaction processing feature.
While Hive as a query engine is mostly dead today, having been replaced by several other more advanced technologies, the key concepts it brought with it are very much still with us today.
How do I run a Python script in S3? ›- Step 1: Create an IAM service role for a hybrid and multicloud environment.
- Step 2: Create a hybrid activation for a hybrid and multicloud environment.
- Step 3: Install SSM Agent for a hybrid and multicloud environment (Linux)
- Step 4: Install SSM Agent for a hybrid and multicloud environment (Windows)
- Setup your Programmatic Access – Create Access Key.
- Install AWS CLI.
- Configure AWS CLI.
- AWS S3 sync Local to S3 bucket - Copying/Uploading files.
- Sync S3 bucket to Local - Downloading Files from S3 to local.
- Sync and delete the extra files on the S3 destination.
How to read CSV file from S3 bucket using Python? ›
- # Create the S3 object.
- obj = client. get_object(
- Bucket = 'sql-server-shack-demo-1',
- Key = 'sql-shack-demo.csv'
- )
- # Read data from the S3 object.
- data = pandas. read_csv(obj['Body'])
- # Print the data frame.
- Install PyHive and Dependancies. Before we can query Hive using Python, we have to install the PyHive module and associated dependancies. ...
- Create Python Script. ...
- Start Hortonworks HDP Sandbox. ...
- Run Python Script.
- Use PySpark with Hive enabled to directly load data from Hive databases using Spark SQL: Read Data from Hive in Spark 1. x and 2. x.
- Use ODBC or JDBC Hive drivers. Cloudera has implemented ODBC drivers for Hive and Impala.
- Native Python libraries.
- Step 1 – Import PySpark.
- Step 2 – Create SparkSession with Hive enabled.
- Step 3 – Read Hive table into Spark DataFrame using spark.sql()
- Step 4 – Read using spark.read.table()
- Step 5 – Connect to remove Hive.
Nest and Hive Features Compared
In terms of the number of features, there's no doubt that Nest wins hands down, but some may question whether all of them are necessary. Some may even find the Hive's shorter list of features means the learning curve is less steep and it has everything they're looking for.
Hive is a trademark owned by Centrica Hive Limited that produces smart home devices.
Is there an alternative to Hive? ›Basecamp. If you're looking for a great Hive alternative that also offers both people and product management, Basecamp is another great option. This software solution offers tools that help reduce the complexities of a typical project management methodology.
Can we create tables in S3? ›On the Create table page, in the Source section, do the following: For Create table from, select Amazon S3. For Select S3 path, enter a URI pointing to the Amazon S3 data in the format s3:// BUCKET_NAME / PATH .
How to convert CSV to Hive table? ›- Step 1: Import the modules. In this scenario, we are going to import the pyspark and pyspark SQL modules and also specify the app name as below: ...
- Step 2: Create Spark Session. ...
- Step 3: Verify the databases. ...
- Step 4: Read CSV File and Write to Table. ...
- Step 5: Fetch the rows from the table. ...
- Step 6: Print the schema of the table.
- Step 1 – Create SparkSession with hive enabled.
- Step 2 – Create PySpark DataFrame.
- Step 3 – Save PySpark DataFrame to Hive table.
- Step 4 – Confirm Hive table is created.
Why hive is better than SQL? ›
Hive writes and queries data in HDFS. SQL requires multiple reads and writes. Hive is better for analyzing complex data sets. SQL is better for analyzing less complicated data sets very quickly.
What is the maximum data size hive can handle? ›The maximum size of a string data type supported by Hive is 2 GB. Hive supports the text file format by default, and it also supports the binary format sequence files, ORC files, Avro data files, and Parquet files.
What is the life span of hive? ›Honey beehives can last indefinitely if conditions are right for them to continue to grow in the spring and summer and hibernate in the winter, intact. If there is a beehive that needs to be removed around your home, contact a trained beekeeper or the experienced pest control experts at Petri Pest Control Services.
What is the difference between Hadoop and Hive? ›Hadoop is a framework to process/query the Big data, while Hive is an SQL Based tool that builds over Hadoop to process the data. Hive process/queries all the data using HQL (Hive Query Language). It's SQL-Like Language, while Hadoop can understand Map Reduce only.
What is the difference between Hive and relational database? ›It differs from a relational database in a way that it stores schema in a database and processed data into HDFS. For processing, Hive provides a SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop.
Where does the data of a Hive table gets stored? ›Hive stores its database and table metadata in a metastore, which is a database or file backed store that enables easy data abstraction and discovery.
Is the Hive getting shut down? ›A free decryptor for files encrypted with the Hive ransomware was released by a South Korean cybersecurity agency in the summer of 2022. UPDATE: The US Department of Justice has confirmed dismantling the Hive ransomware operation.
Is Facebook using Hive? ›Conclusion: Facebook designed hive for data analysis and bring out insights quickly, not putting more time on writing programs on data. Key points: Let us map the functionalities of Hive to Map Reduce.
What language does Hive use? ›Hive provides an SQL dialect, called Hive Query Language (abbreviated HiveQL or just HQL) for querying data stored in a Hadoop cluster. SQL knowledge is widespread for a reason; it's an effective, reasonably intuitive model for organizing and using data.
Can I run Python on AWS? ›The AWS SDK for Python (Boto3) enables you to use Python code to interact with AWS services like Amazon S3. For example, you can use the SDK to create an Amazon S3 bucket, list your available buckets, and then delete the bucket you just created.
How to read and write in S3 bucket using Python? ›
Create a s3 client. Read a csv file from local filesystem that has to be moved to s3 bucket. Now send the put_object request to write the file on s3 bucket. Check s3 bucket on s2 management console to validate if file is uploaded.
How to list all files in S3 bucket Python? ›- Create Boto3 session using boto3.session() method.
- Create the boto3 s3 client using the boto3. ...
- Invoke the list_objects_v2() method with the bucket name to list all the objects in the S3 bucket.
The AWS SDK for Python (Boto3) provides a Python API for AWS infrastructure services. Using the SDK for Python, you can build applications on top of Amazon S3, Amazon EC2, Amazon DynamoDB, and more.
How do I read data from aws S3 using PySpark? ›- 1.1 textFile() – Read text file from S3 into RDD. sparkContext. ...
- 1.3 Reading multiple files at a time. ...
- 1.4 Read all text files matching a pattern. ...
- 1.5 Read files from multiple directories on S3 bucket into single RDD.
- Step 1: Install required packages. ...
- Step 2: Configure Spark with Python (PySpark) ...
- Step 3: Configuration for Connecting PySpark to AWS. ...
- Step4: Setup AWS Credentials. ...
- Step 5: Validate Reading Parquet Files from S3.
- Writing csv file to Amazon S3 using python.
- using boto to upload csv file into Amazon S3 bucket.
- CSV file upload from buffer to S3.
- saving csv file to s3 using boto3.
- Python: Read CSV from S3 bucket with `import csv`
- Django: Uploading a CSV file to AWS S3.
Pandas (starting with version 1.2. 0) supports the ability to read and write files stored in S3 using the s3fs Python package. S3Fs is a Pythonic file interface to S3.
How do I connect my S3 to boto3? ›- Installing boto3 to your application: On the Terminal, use the code. ...
- Build an S3 client to access the service methods: ...
- Listing buckets: ...
- Deleting Buckets: ...
- Listing the files from a bucket: ...
- Uploading files: ...
- Downloading files: ...
- Deleting files:
- Cursors support with. from hivejdbc import connect conn = connect('example.com', database='default') with conn. ...
- Cursors are iterable. from hivejdbc import connect conn = connect('example.com', database='default') cursor = conn. ...
- Username and Password.
- Load JSON file from your home folder in HDFS to Hive.
- Login to the web console. ...
- Copy /data/sample_json from HDFS to your home folder in HDFS. $ ...
- Check the content of the sample_json directory. ...
- It contains the file user_country. ...
- Launch Hive by typing hive in the web console. ...
- Use your database.
How to connect Hive using pandas? ›
Connecting to Hive Data
Create a connection string using the required connection properties. For this article, you will pass the connection string as a parameter to the create_engine function. Set the Server, Port, TransportMode, and AuthScheme connection properties to connect to Hive.
With Amazon EMR release version 5.18. 0 and later, you can use S3 Select with Hive on Amazon EMR. S3 Select allows applications to retrieve only a subset of data from an object.
How to fetch data from Hive? ›To list out the databases in Hive warehouse, enter the command 'show databases'. The database creates in a default location of the Hive warehouse. In Cloudera, Hive database store in a /user/hive/warehouse. Copy the input data to HDFS from local by using the copy From Local command.
What is the difference between Hive and PySpark? ›Hive uses HQL, while Spark uses SQL as the language for querying the data. Access rights is another difference between the two tools with Hive offering access rights and grouping the users as per their roles. However, no such option is present in Spark SQL.
How do I connect to Hive from Jupyter? ›- Put the cluster authentication file user.keytab to the jupyter notebook host's /opt directory for Hive authentication, and put the authentication-related krb5.conf file to the /etc/ path.
- Create a jaas.conf configuration file in the jupyter notebook host /opt directory, the content is as follows:
PySpark is the Python API for Apache Spark. It enables you to perform real-time, large-scale data processing in a distributed environment using Python. It also provides a PySpark shell for interactively analyzing your data.
Does Hadoop support S3? ›Introducing the Hadoop S3A client.
Hadoop's “S3A” client offers high-performance IO against Amazon S3 object store and compatible implementations. Directly reads and writes S3 objects. Compatible with standard S3 clients.
While Apache Hadoop has traditionally worked with HDFS, S3 also meets Hadoop's file system requirements. Companies such as Netflix have used this compatibility to build Hadoop data warehouses that store information in S3, rather than HDFS.
Is S3 HDFS compatible? ›HDFS and the EMR File System (EMRFS), which uses Amazon S3, are both compatible with Amazon EMR, but they're not interchangeable. HDFS is an implementation of the Hadoop FileSystem API, which models POSIX file system behavior.
Which is not supported by Hive? ›Hive does not support all schemas, and has some limitations on column names, notably: It does not preserve case, so some columns names can conflict.
Is HDFS faster than S3? ›
S3 is an order of magnitude slower than HDFS, but the speed of its various operations remains stable, and the total TPS grows as the number of concurrent operations increases.
How to transfer data from S3 to HDFS? ›For JAR location, enter command-runner.
For more information, see Run commands and scripts on an Amazon EMR cluster. For Arguments, enter options similar to the following: s3-dist-cp --src=s3://s3distcp-source/input-data --dest=hdfs:///output-folder1. For Action on failure, choose Continue.
Scalability and support for structured, semi-structured, and unstructured data – Amazon S3 is a petabyte scale object store which provides virtually unlimited scalability to store any type of data.
Why S3 Cannot be used as database? ›PRO TIP: No, AWS S3 is not a database. It is a simple storage service that stores files and folders in the cloud. It cannot be used to store structured data or run database queries. Because AWS S3 can be used to store data in a centralized location, it can make it easier to manage and store data.
Will Apache spark replace Hadoop? ›Hadoop excels over Apache Spark in some business applications, but when processing speed and ease of use is taken into account, Apache Spark has its own advantages that make it unique. The most important thing to note is, neither of these two can replace each other.
When should we not use Hadoop framework? ›Hadoop framework is not recommended for small-structured datasets as you have other tools available in market which can do this work quite easily and at a fast pace than Hadoop like MS Excel, RDBMS etc. For a small data analytics, Hadoop can be costlier than other tools.
Can you run code on S3? ›You can run it. It will execute your script from S3 on EC2 instance. You can find the working sample here: Ping me on @ljmocic if you have any questions!
Can you use S3 as a database? ›AWS S3 is a key-value store, one of the major categories of NoSQL databases used for accumulating voluminous, mutating, unstructured, or semistructured data. Uploaded objects are referenced by a unique key, which can be any string.
What is S3 compatible API? ›The term S3-compatible means the data store uses the S3 API to communicate with any connected device or system. It originally referred to data stored in a public cloud; however, S3-compatible storage has extended to on-premises and private cloud deployments.
Is Hive an ETL or not? ›Hive is a powerful tool for ETL, data warehousing for Hadoop, and a database for Hadoop. It is, however, relatively slow compared with traditional databases.
Is the Hive shutting down? ›
A free decryptor for files encrypted with the Hive ransomware was released by a South Korean cybersecurity agency in the summer of 2022. UPDATE: The US Department of Justice has confirmed dismantling the Hive ransomware operation.