这是indexloc提供的服务,不要输入任何密码
Skip to content

tdemareuil/tdemareuil.github.io

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

50 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Personal website

This repository holds the files for my personal website, accessible at tdemareuil.github.io.

SQL translator

It also includes a SQL queries translator (Hive ↔︎ Presto ↔︎ Vertica) that I developed as a side project next to my job at Criteo. It's available at: https://tdemareuil.github.io/translate_sql.html.

Goals

  • Faster conversion of SQL queries between Hive ↔︎ Presto ↔︎ Vertica
  • Less hurdle checking the small language details manually, less errors
  • Mainly for data analysis purposes, not prod

Possible improvements

  • Some array functions and concatenation operations still aren’t supported by the script
  • I tried to make it robust to a lot of situations, but sometimes you still might need to check the output of conversion
  • Translating more complex, production-ready queries, and potentially in batches, is a very complicated task and could be an entire project of its own.

Detailed capabilities

Hive

Presto

Vertica

Translation available

Operators & common functions

LATERAL VIEW EXPLODE (original_column) t AS new_column, with small (but very tricky) syntax differences when you explode an array, a map or an array of structs

CROSS JOIN UNNEST(original_column) AS t (new_column), with small (but very tricky) syntax differences when you unnest an array, a map or an array of structs

CROSS JOIN UNNEST(original_column) AS t (new_column), with small (but very tricky) syntax differences when you unnest an array, a map or an array of structs

LATERAL VIEW OUTER EXPLODE (to keep rows with empty array/map)

LEFT JOIN UNNEST (to keep rows with empty array/map)

LEFT JOIN UNNEST (to keep rows with empty array/map)

Supports IF() function (and CASE WHEN)

Supports IF() function (and CASE WHEN)

Doesn't support IF() function → use CASE WHEN

Supports LIKE and RLIKE (regex), not ILIKE (case insensitive)

Supports LIKE only

Supports LIKE and ILIKE (case insensitive), not RLIKE (regex)

Doesn't support IFNULL function → use COALESCE

Doesn't support IFNULL function → use COALESCE

Supports IFNULL function (and COALESCE). Also supports NULLIFZERO and ZEROIFNULL.

Use PMOD(n,m) to return the modulus (remainder) of n divided by m

Use MOD(n,m) to return the modulus (remainder) of n divided by m

Use MOD(n,m) to return the modulus (remainder) of n divided by m

To compute any quantile, use PERCENTILE_APPROX(x, 0.7)

To compute any quantile, use APPROX_PERCENTILE(x, 0.7)

To compute any quantile, use APPROXIMATE_PERCENTILE(x USING PARAMETERS percentile=0.7)

Data types

Dividing 7 by 2 will result in 3.5

Dividing 7 by 2 will result in 3 → to perform floating point division on two integers, cast one of them as a double

Dividing 7 by 2 will result in 3.5

Supports STRING format or VARCHAR(n), with n specified

Doesn't support STRING format but supports VARCHAR(n), with n specified or not

Doesn't support STRING format but supports VARCHAR(n), with n specified or not

Cannot use BOOL instead of BOOLEAN

Cannot use BOOL instead of BOOLEAN

Can use BOOL instead of BOOLEAN

Syntax

Can start an identifier with numbers

Identifiers that start with numbers must be quoted using double quotes: "7day_active"

Identifiers that start with numbers must be quoted using double quotes: "7day_active"

Doesn't support :: notation for type casting

Doesn't support :: notation for type casting

Supports :: notation for type casting

Uses ` for identifiers

Uses " for identifiers

Uses " for identifiers

Cannot use column index in GROUP BY and ORDER BY (or change parameters: SET hive.groupby.orderby.position.alias=true)

Can use column index in GROUP BY and ORDER BY

Can use column index in GROUP BY and ORDER BY

Array operations

Arrays are indexed starting from 0

Arrays are indexed starting from 1

Arrays are indexed starting from 0

Aggregate values as an array: use collect_list(). Add the DISTINCT keyword to deduplicate or use collect_set() to do both at once.

Aggregate values as an array: use array_agg(). Add the DISTINCT keyword or array_distinct() to deduplicate (note that the DISTINCT keyword cannot be used in a window function in Presto).

Use listagg() to aggregate values as comma-separated strings, not as a proper array (could use STRING_TO_ARRAY('['||col||']', ',') to do the trick, and could add USING PARAMETERS max_length=1000000 to avoid length errors). Add the DISTINCT keyword to deduplicate.

Standalone array deduplication: use collect_set(flatname) from (select id, flatname from test lateral view explode(name) t as flatname) g group by id

Standalone array deduplication: use array_distinct()

Seems like we cannot deduplicate an array without cross-joining.

🚫

Forced to use lateral view explode to run array functions (regular min, max, sum, avg)

All the main array functions (min, max, sum, avg, etc.) start with array_{function}(). Note that for the average it’s array_average() (not avg)

All the main array functions (min, max, sum, avg, etc.) start with array_{function}().

🚫

No direct equivalent

Complex array functions such as map, transform, map_from_entries

No direct equivalent

🚫

size() returns the length of an array

cardinality() returns the length of an array

array_length() returns the length of an array

Use ARRAY(content) to create an array

Use ARRAY[content] to create an array

Should be able to use ARRAY[content] to create an array (?)

Concatenate strings with concat_ws('sep', string1, string2, etc.) or concat_ws('sep', array(strings)) (input must be strings or array of strings, contrary to Presto and Vertica)

Concatenate strings with array_join(array, 'sep') or with the || operator

Concatenate strings with concat(val1, val2) (2 elements max) or with the || operator

🚫

NAMED_STRUCT('A', col_A, 'B', col_B) (realiasing inside the struct is not possible when doing Hive → Presto / Vertica)

ROW(col_A, col_B) or CAST(ROW(col_A, col_B) as ROW(A int, B int))(realiasing of the struct is possible when doing Presto / Vertica -> Hive)

ROW(col_A, col_B) or CAST(ROW(col_A, col_B) as ROW(A int, B int))(realiasing of the struct is possible when doing Presto / Vertica -> Hive)

map_from_arrays(key, collect_list(value)) or map_from_arrays(key, collect_set(value))

MAP_AGG(key, value)

MAPAGGREGATE(key, value)

array_contains(array, x)

contains(array, x)

array_contains(array, x)

Date & Time

Use from_unixtime() and unix_timestamp() functions to convert timestamp ↔︎ unixtime

Use from_unixtime() and to_unixtime() functions to convert timestamp ↔︎ unixtime

Use to_timestamp() and EXTRACT(EPOCH from date) to convert timestamp ↔︎ unixtime

select current_date - INTERVAL '10' day returns a timestamp, and doesn't need ''

select current_date - INTERVAL '10' day returns a date and needs ''

select current_date - INTERVAL '10' day returns a timestamp and needs ''

Supports to_date() function

Doesn't support to_date() function → use DATE()

Doesn't support to_date() function → use DATE()

You can extract date parts with TRUNC(str, pattern) or EXTRACT(part from str)

You can extract date parts with DATE_TRUNC(part, date) or EXTRACT(part from date)

You can extract date parts with TRUNC(date, pattern) or TIMESTAMP_TRUNC(date, pattern) or DATE_TRUNC(part, date) or DATE_PART(part, date) or EXTRACT(part from date)

DATEDIFF(str, str) only returns a day difference, works with strings, and substracts the other way round vs. Presto and Vertica

DATE_DIFF(unit, date, date)

DATEDIFF(unit, date, date) or TIMESTAMPDIFF(unit, str, str)

DATE_FORMAT(str, str_format) works with a date already formatted as a string + beware of pattern letters differences

DATE_FORMAT(date, str_format) only works with a date + beware of pattern letters differences

TO_CHAR(date, str_format) only works with a date + beware of pattern letters differences

DATE_ADD(str, value), can only add days, supports date string, negative values are possible but DATE_SUB exists as well

DATE_ADD(unit_str, value, date), with possible negative value, only datetime formats

TIMESTAMPADD(unit_str, value, date), with possible negative value, supports date string, returns a timestamp (not date)

User Interface

Usually, to run Python code in the browser, you need to set-up a web app (e.g. Flask) and to host it on a server (e.g. Heroku or PythonAnywhere). However, it's possible to run a short Python script directly in the browser's front end, even more easily, thanks to Pyodide.

Pyodide is a Python distribution for the browser and Node.js based on WebAssembly. It comes with a robust Javascript ⟺ Python foreign function interface so that you can freely mix these two languages in your code. See documentation.

How does Pyodide work in practice?

→ You just have to include the Pyodide <script> at the start of your HTML file. Then you can access functions such as runPython. You can access variables from the Python scope from JS, and vice versa. All the main Python packages are supported, and you can import others manually.

You can access my small tool at: https://tdemareuil.github.io/translate_sql.html.

Of course, the UI is very simple (pure HTML), and it's not secure at all (you can actually see if the full Python code if you just inspect the HTML..!) — but it's a first step!

About

My personal website.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published