On-chain market data


This artlicle intends to explain the data structure format of the on-chain raw data output generated by the open source Ethereum blockchain ETL tool called Ethereum ETL. We, in Credmark, internally call this raw blockchain data as ‘’First Order Data’’. This document also gives a summary of the on-chain data extraction process. For interaction with smart contracts we have used web3.py python library and Infura API to establish the connection with the ethereum network node.

Additionally, this document also explains the methods to fetch Pricing data of underlying assets across different protocols along with some sample outputs which can help us understand the spread of the price of a single asset across the protocols on the ETH chain.

First Order Data (Raw Ethereum Data)

First Order Data is the raw blockchain data that is fetched and stored by Ethereum ETL software . The data is stored in five categories, namely blocks, contracts, logs, receipts and transactions.

Structure of directories for Ethereum ETL Data

Data is downloaded by EthereumETL in batches on the basis of block number specified by the user while setting up the node categorized.

Example below shows structure of output data received by running Ethereum ETL for block range specified for each category of the raw data.

Let’s go over each data category(as shown in folder structure above) one by one:

  1. Blocks

Sample file for blocks data obtained from EthereumETL available here: Blocks

The details of columns in the blocks.csv are as follows :

  • number - the block number. null when its pending block

  • hash - hash of the block. null when its pending block.

  • parent_hash - hash of the parent block.

  • nonce - hash of the generated proof-of-work. null when its pending block.

  • sha3_uncles - SHA3 of the uncles data in the block.

  • logs_bloom - the bloom filter for the logs of the block. null when its pending block

  • transactions_root - the root of the transaction trie of the block.

  • state_root - the root of the final state trie of the block.

  • receipts_root - the root of the receipts trie of the block.

  • miner - the address of the beneficiary to whom the mining rewards were given.

  • difficulty - integer of the difficulty for this block.

  • total_difficulty - integer of the total difficulty of the chain until this block.

  • size - integer the size of this block in bytes.

  • extra_data - the "extra data" field of this block.

  • gas_limit - the maximum gas allowed in this block

  • gas_used - the total used gas by all transactions in this block.

  • timestamp - the unix timestamp for when the block was collated.

  • transaction_count - the number of transactions included in this block

  • base_fee_per_gas - the minimum price per unit of gas for inclusion in this block, calculated by the network based on demand for block space.

2. Contracts

Sample file for contracts data obtained from EthereumETL available here: Contracts

The details of columns in the contracts.csv are as follows:

  • address - the address of the smart contract. null when its pending block

  • bytecode - the code stored on-chain that describes a smart contract.

  • function_sighashes - all function signature hashes of the contract

  • is_erc20 - boolean variable describing whether the contract follows erc_20

  • is_erc721 - boolean variable describing whether the contract follows erc_721

  • block_number - the block number in which the contract was pushed on network

3. Logs

Sample file for logs data obtained from EthereumETL is available here: Logs

The details of columns in the logs.csv are as follows:

  • log_index - Integer of the event index position in the block.

  • transaction_hash - Hash of the transaction this event was created in.

  • transaction_index - Integer of the transaction’s index position, the event was created in.

  • block_hash - Hash of the block where this event was created in. null if still pending.

  • block_number - The block number where this log was created in. null if still pending.

  • address - From which this event originated from.

  • data - The data containing non-indexed log parameters.

  • topics - An array with max 4 32 Byte topics, topic 1-3 contains indexed parameters of the log. topic 0 is the event id.

4. Receipts

Sample file for receipts data obtained from EthereumETL is available here: Receipts

The details of columns in the receipts.csv are as follows:

  • transaction_hash - Hash of the transaction.

  • transaction_index - Integer of the transactions index position in the block.

  • block_hash - Hash of the block where this transaction was in.

  • block_number - Block number where this transaction was in.

  • cumulative_gas_used - The total amount of gas used when this transaction was executed in the block.

  • gas_used - The amount of gas used by this specific transaction alone.

  • contract_address - The contract address created, if the transaction was a contract creation, otherwise null.

  • root - the root of the final state trie of the transaction.

  • status - TRUE if the transaction was successful, FALSE if the EVM reverted the transaction.

  • effective_gas_price - Price of the gas denoted in small fractions of ether called gwei.

5. Transactions

Sample file for transactions data obtained from EthereumETL is available here: Transactions

The details of columns in the transactions’ csv are as follows-

  • hash - Hash of the transaction.

  • nonce - The number of transactions made by the sender prior to this one.

  • block_hash - Hash of the block where this transaction was in. null if pending.

  • block_number - Block number where this transaction was in. null if pending.

  • transaction_index - Integer of the transactions index position in the block. null if pending.

  • from_address - Address of the sender.

  • to_address - Address of the receiver. null if it’s a contract creation transaction.

  • value - Value transferred in wei.

  • gas - Gas provided by the sender.

  • gas_price - Gas price provided by the sender in wei.

  • input - The data sent along with the transaction.

  • block_timestamp - the unix timestamp for the block this transaction was in.

  • max_fee_per_gas - the maximum amount of gas willing to be paid for the transaction (inclusive of baseFeePerGas and maxPriorityFeePerGas)

  • max_priority_fee_per_gas - the maximum amount of gas to be included as a tip to the miner.

  • transaction_type - a number between 0 and 0x7f, for a total of 128 possible transaction types.

Pricing Data

There are two types of exchanges on which the pricing data can be extracted -

  1. Centralized Exchanges (CEX)

  2. Decentralized Exchanges (DEX)

While there are major differences between these two types of exchanges, we will only discuss the difference of the pricing data in these.

  1. Centralized Exchanges

Centralized Exchanges work in a similar fashion as traditional stock exchanges do and report the prices in the native FIAT currencies. For example, a CEX based in the US will show the price of an asset in US Dollar and a CEX based in India will show the price in INR. These sources of information are out of scope for the purposes of the current article.

2. Decentralized Exchanges

Decentralized Exchanges mostly work on "Constant Product Principal" and report the price of an asset in terms of another asset.

For example, WBTC-WETH pool on decentralized exchange, Uniswap will report the price of WBTC in terms of WETH and the price of WETH will be reported in WBTC.

To access the pricing data from both the sources, generally one of these two methods are followed

  1. API call

  2. Oracle call

API Call

To explain API Call method in this article, we will be using Coingecko’s API

First, we will fetch all the data of top 5000 assets (by market capitalisation) tracked by Coingecko using:

def fetch_market_data(vs_currency='usd', order='market_cap_desc', per_page='100', page='1'):
    COINS_MARKETS_URL = 'https://api.coingecko.com/api/v3/coins/markets?vs_currency={0}&order={1}&per_page={2}&page={3}'.format(vs_currency, order, per_page, page)
    resJs = requests.get(COINS_MARKETS_URL).json()
    # type(resJs)
    return resJs

This function, “fetch_market_data” will require two optional parameters vs_currency, order, per_page and page.

We will be using the US Dollar as default currency and the descending order by market capitalization as the default order to get pricing data.

These parameters will be passed in the Coingecko’s API link and the required response will be returned in JSON format.

However, maximum 100 results can be return in one page, so we have to loop through the 51 pages to get data of top 5000 coins, ordered by market cap in descending order.

# Looping through the first 5000 coins from Coingecko by market ranks and saving them in list, market_data.
market_data = list()
for i in range(51):
    data = fetch_market_data(page=i)
    i += 1

The market_data is a list of top 5000 assets with information tracked by Coingecko from current price and market cap to circulating supply and all time high and low prices.

Below is the example of returned response for Bitcoin (BTC):

Coingecko Bitcoin Example

Pricing data can also be fetched from using contract address in a specific API of Coingecko using the following code:

# To fetch the details of smart contract from Coingecko and return the current price
def fetch_price_from_contract_address(_chain, _contractAddress):
    URL = 'https://api.coingecko.com/api/v3/coins/{0}/contract/{1}'.format(_chain, _contractAddress)
    # print(URL)
    resJS = requests.get(URL).json()
    # return resJS
    usdPrice = resJS['market_data']['current_price']['usd']
    return usdPrice

Oracle Call

Oracle call is an on-chain data and information fetching method, mostly used by smart contracts to keep track of events happening outside the walls of blockchain.

This is an expensive approach when compared to API call approach as it requires the request of creation of smart contract for specific purpose of the user’s smart contract and rewarding validators and data providers to correct information.

Prices can be fetched from different protocols for the asset prices. Most protocols have different methodologies to extract the data. As an example, we have included the sample file screenshots from extraction of ETH-USDC pair asset prices in the appendix below to understand the asset price data structure output across different protocols on-chain. Appropriate code sample is provided for some of the sources as an example in the Appendix section.


ETH-USDC Pricing across different protocols.

  1. Curve Finance

The below code snippet will fetch the price of ether from Curve Finance’s API and store it along with the timestamp into the dataframe, df.

As only current price and timestamp is returned by this API, therefore this API call is looped after every 60 seconds and data is appended to dataframe.

ind = 0
df = pd.DataFrame(columns=['price', 'time'])
for i in range(10000):
    res = requests.get("https://api.curve.fi/api/getETHprice").json()
    if res['success']==True:
        df.loc[ind] = [res['data']['price']] + [res['data']['generatedTimeMs']]
        ind += 1
    if i % 100 == 0:

1.a Data Sample Screenshot

Curve Finance Data Sample

2. Sushiswap

2.a Data Sample Screenshot

Sushiswap Data Sample

3. Uniswap V3

The following code snippet will fetch the current tick and tick spacing from Uniswap V3 ETH-USD contract instance and calculate MIN_TICK, MAX_TICK, MIN_PRICE, MAX_PRICE using mathematical formulas also mentioned in Uniswap V3’s whitepaper.

address = "0x88e6a0c2ddd26feeb64f039a2c41296fcb3f5640"
abi = etherscan_abi_request(address, ETHERSCAN_API_KEY )
contract_instance       = w3.eth.contract(address=address, abi=abi)


CURRENT_TICK = contract_instance.functions.slot0().call()[1]
TICK_SPACING = contract_instance.functions.tickSpacing().call()

MIN_PRICE = 1 / (pow(1.0001, MAX_TICK) * pow(10,(USD_DECIMALS - ETH_DECIMALS )))
MAX_PRICE = 1 / (pow(1.0001, MIN_TICK) * pow(10,(USD_DECIMALS - ETH_DECIMALS )))

3.a Data Sample Screenshot

Uniswap V3 Data Sample

4. Aave V2

The following code snippet will fetch the price of Ether in US Dollar from Aave’s Native API and is looped back to past 100 days from the START_DATE.

df  =  pd.DataFrame(columns = ['Date', 'Prices'])
START_DATE    = datetime.strptime('12-12-2021', '%m-%d-%Y')
ind = 0
for i in range(100):
    res = requests.get('https://aave-api-v2.aave.com/data/liquidity/v2?poolId=0xb53c1a33016b2dc2ff3653530bff1848a515c8c5&date='+str(CURRENT_DATE.date())).json()
    EVERYDAY_DATA[str(CURRENT_DATE.date())] = res
    df.loc[ind] = [str(CURRENT_DATE)] + [EVERYDAY_DATA[str(CURRENT_DATE.date())][0]['referenceItem']['priceInUsd']]
    CURRENT_DATE = CURRENT_DATE - timedelta(1)

4.a Data Sample Screenshot

5. Compound V2

5.a Data Sample Screenshot

6. Balancer

6a Data Sample Screenshot

7. Chainlink

The following code snippet will fetch the Ether price in US Dollar using Chainlink oracle. First a contract instance is established for the contract address of ETH-USD oracle and web3.py library.

After that the latest round number is fetched using latestRoundData and looped in reverse order to get pricing information using getRoundData until a predefined stop date.

chainLinkOracleDic = {"eth-usd": "0x5f4ec3df9cbd43714fe2740f5e3616155c5b8419", 
                        "link-usd": "0x2c1d072e956affc0d435cb7ac38ef18d24d9127c",
                        "usdt-usd": "0x3e7d1eab13ad0104d2750b8863b489d65364e32d",
                        "dai-usd": "0xaed0c38402a5d19df6e4c03f4e2dced6e29c1ee9",
                        "dai-eth": "0x773616e4d11a78f511299002da57a0a94577f1f4"

address = chainLinkOracleDic["eth-usd"]
test_res_v2 = etherscan_abi_request(address, ETHERSCAN_API_KEY )
address_CONTRACT_INSTANCE = w3.eth.contract(address = address, abi=test_res_v2)
#roundID contains latestRound (pos 0) and latestAnswer (pos 1) and updatedAt (pos 3)

latestRoundData = address_CONTRACT_INSTANCE.functions.latestRoundData().call()

latestRound = latestRoundData[0]
latestAnswer = latestRoundData[1]
latestTime = latestRoundData[3]

decimals = address_CONTRACT_INSTANCE.functions.decimals().call()

#collect data from latestRound until stopDate

df = pd.DataFrame(columns=['round', 'prices', 'updated at'])

ind = 0

stopDate = datetime.datetime(2021,10,1)

for i in reversed(range(latestRound)):
    roundData = address_CONTRACT_INSTANCE.functions.getRoundData(i).call()
    dateTime = datetime.datetime.fromtimestamp(roundData[3])
    answer = roundData[1] 
    if dateTime >= stopDate:
        df.loc[ind] = [i] + [np.divide(answer, 10**decimals)] + [dateTime]
        ind += 1       

7.a Data Sample Screenshot


  1. Trie, also called Radix Trie, Patricia Trie, or Prefix Tree, is a data structure which is fastest at finding common prefixes, simple to implement, and requires small memory. As Ethereum uses Merkle Tree to efficiently store the hash in blocks, Trie is used as a core data structure of data storage.

  2. Constant product formula: x * y = k, where x is the amount of Token#1 in the liquidity pool, y is the amount of Token#2 in the liquidity pool, and k is a fixed constant.

    Price of Token#1 will be calculated in terms of Token#2 according to their respective quantity in the pool while maintaining the constant product of both tokens’ quantity.


Discord HandleETH AddressRewardContribution



0 CMK (internal)

Original version

Last updated