Populating Cloud Data Warehouses with Apache NiFi


Problem:


Constantin, a Master Data Engineer pointed out that JSON, Parquet or Avro are the most used semi-structured data types, but some Enterprise Applications still produce XML!  Parsing XML is expensive,  Ingesting XML is challenging and Large XML files are hard to chunk for parallel ingest.  We still need this data and don't want to write custom code to parse it.   So let's break it down and ingest this data at scale.

Source:   https://github.com/tspannhw/clouddatawarehouse


Ingest Raw Data:



<?xml version='1.0' encoding='utf-8'?><rss xmlns:dc='http://purl.org/dc/elements/1.1/' version='2.0'><channel><title>travel.state.gov: Travel Advisories</title><link>travel.state.gov: Travel Advisories</link><description>http://travel.state.gov/content/travel/en/traveladvisories/traveladvisories.html</description><item><title>Bhutan - Level 1: Exercise Normal Precautions</title><pubDate>Fri, 25 Jan 2019</pubDate><link>http://travel.state.gov/content/travel/en/traveladvisories/traveladvisories/bhutan-travel-advisory.html</link><guid>http://travel.state.gov/content/travel/en/traveladvisories/traveladvisories/bhutan-travel-advisory.html</guid><dc:identifier>BT</dc:identifier><description><![CDATA[<p>Exercise normal precautions in Bhutan.&nbsp;</p>]]></description></item>



Apache NiFi Flow to Read US Travel Advisors XML




Let's Ingest Those Advisors For Fast Queries and ML Analysis








Any Data, Anywhere, Anytime


With Apache NiFi you can populate Cloud Data Warehouses with ease directly through JDBC or to S3 buckets from sources including email, REST, Logs, Databases, NoSQL stores, Hadoop, XML, PDF, Word, AVRO, JSON, CSV, Text, Syslog and more.   For this example I am grabbing XML/RSS feeds from public sources on travel advisors around the world for US citizens.   It is easy to route, cleanse, transform and store at scale in our hyperscale Cloud Data Warehouse regardless of cloud.

Source URL:   https://travel.state.gov/_res/rss/TAsTWs.xml



Table Design

Let's create some tables. You should design an ER schema and use either 3NF, Snowflake Schema, Star Schema or Data Vault. Here are two simple tables to contain all the fields we have. We may want to add a primary key and date / timestamp to our travelwarning table as well.


CREATE TABLE TRAVELWARNING ( title VARCHAR,  pubDate VARCHAR,  link VARCHAR,  guid VARCHAR,  identifier VARCHAR,  description VARCHAR );

CREATE TABLE TRAVELWARNING_RAW( TW_WARN_ID NUMBER(38,0) identity primary key, TW_WARNING VARIANT, TW_CREATE_DT DATE DEFAULT CURRENT_DATE );



Connecting to A Cloud Data Warehouse
Image
We just need a JDBC Driver, Driver class name, URL, username, password and download JAR.   Now we can connect for queries or inserts with ease at full speed.   Make sure you download the latest version of the JDBC driver you need and store it somewhere that the Apache NiFi user can use it.  I recommend a special directory for these drivers.  
We can easily use record processing to ingest 1,000s of records and convert them to useful types like JSON or AVRO before inserting them into our data warehouse.
Let's insert rows into our Cloud Data Warehouse on AWS as raw JSON into a VARIANT column using PutSQL.
Image

We can also use Records to easily infer the schema and push to our our regularly defined table using PutDatabaseRecord.

We can now query the data as it is arriving.
Image

Other Ingest Options

We can also land the JSON data in S3 and have that ingested via Snowpipe or bulk load with COPY

Examples of Other Data We Can Ingest




Useful Cloud Data Warehouse Resources







Comments

  1. I appreciate the time and effort you put into this article about Azure Data Migration Services. This information is both useful and lavish. I'd like to express my gratitude to everyone who contributed to this article.

    ReplyDelete
  2. The article you've shared here is fantastic because it provides some excellent information that will be incredibly beneficial to me. Thank you for sharing that. Keep up the good work. warehouse Japan

    ReplyDelete
  3. I always prefer to read the quality content and this thing I found in you post. I am really thank full for you for this post. we also provide service for warehouse Japan. for more info visit our website.

    ReplyDelete
  4. I appreciate the work and effort that went into this article. This paragraph provides you with a clear picture. Without a question, it is an insightful piece that may assist us in expanding our understanding.Vietnam Import Data

    ReplyDelete
  5. Play Slots Online with CHLOE Casino
    Explore Play Slots Online with งานออนไลน์ CHLOE Casino ☆ Sign up now for instant play ✚ deccasino Best online slots casino bonuses ⏩Play choegocasino the most popular

    ReplyDelete
  6. Where to Bet on Sports To Bet On Sports In Illinois
    The best sports bet types and https://deccasino.com/review/merit-casino/ bonuses available in worrione Illinois. The most common sports betting options 토토 사이트 코드 available. Bet $20, 출장샵 Win $150, Win $100 or wooricasinos.info

    ReplyDelete

Post a Comment