This project is mirrored from https://gitee.com/cowcomic/pixie.git.
Pull mirroring failed .
Repository mirroring has been paused due to too many failed attempts. It can be resumed by a project maintainer.
Repository mirroring has been paused due to too many failed attempts. It can be resumed by a project maintainer.
- 12 Sep, 2020 2 commits
-
-
Michelle Nguyen authored
Summary: updated the graph configs to speed things up, such as the number of stabilization iterations, smooth edges, improved layout (which actually console.logs that you should disable it for large graphs) clustered graphs look like they were having a problem since we were setting the cluster id, when it was already set to what we wanted. there is still some slowness, but atleast i havent had anything hang. Test Plan: tried it with customer's non-prod clusters, which have pretty big graphs Reviewers: zasgar, nserrino, #engineering Reviewed By: zasgar, #engineering Differential Revision: https://phab.corp.pixielabs.ai/D6215 GitOrigin-RevId: c6e57465357bc3207b78bc9d3dab2a19f0511f34
-
Michelle Nguyen authored
Summary: we were seeing issues where gql requests were getting 502s while a grpc request was being made. this is because we can't serve both requests on the same stream in nginx. Test Plan: tested on staging with something polling gql, and something else polling a grpc request Reviewers: zasgar, #engineering Reviewed By: zasgar, #engineering Differential Revision: https://phab.corp.pixielabs.ai/D6214 GitOrigin-RevId: 9a253e182d3a1830dd815ba48c47b6eccc5ffc3e
-
- 11 Sep, 2020 1 commit
-
-
Yaxiong Zhao authored
Summary: Previously the function returns an error status. This avoids confusing error when deploying a probe that does not actually probe arguments with unsupported types. GetFunctionArgInfo() is called to return the ArgInfo of all of the arguments of a function, regardless of whether the argument is probed or not. This fix tries to minimize the scope of changes. A possible alternative is to only resolve ArgInfo for each arg expression, which appears more intrusive. Test Plan: Jenkins Reviewers: oazizi, #engineering Reviewed By: oazizi, #engineering Differential Revision: https://phab.corp.pixielabs.ai/D6203 GitOrigin-RevId: 6f18e52e97d0be957b1c038e0229ca4557f9b1f6
-
- 09 Sep, 2020 1 commit
-
-
Omid Azizi authored
Summary: For readability Test Plan: Existing tests Reviewers: yzhao, #engineering Reviewed By: yzhao, #engineering Differential Revision: https://phab.corp.pixielabs.ai/D6188 GitOrigin-RevId: bc60521e0983ea6debac62555240e035b98a42d5
-
- 11 Sep, 2020 4 commits
-
-
Omid Azizi authored
Summary: Restoring the BPFTrace submodule and build. Test Plan: Manual Reviewers: yzhao, #engineering Reviewed By: yzhao, #engineering Differential Revision: https://phab.corp.pixielabs.ai/D6211 GitOrigin-RevId: c4fab9013bda7dd6ea92070da2cb864f430510bd
-
Phillip Kuznetsov authored
Summary: px.equals_any replaces long, chained or equal statements with a simple call: ``` df = df[px.equals_any(df.remote_addr, ['10.0.0.1', '10.0.0.2', '10.0.0.3'])] ``` Test Plan: added compiler test as it requires multiple steps through the compuiler Reviewers: nserrino, #engineering, zasgar Reviewed By: #engineering, zasgar Differential Revision: https://phab.corp.pixielabs.ai/D6209 GitOrigin-RevId: ff83c3485393751f7c1a62732ce3b2e6edc2298d
-
Michelle Nguyen authored
Summary: we're running into a bug where our GetAgentUpdates grpc streams are never terminated if a http2 timeout is hit. we dont want all of these zombie streams to keep running in metadata, since they can build up, so we enforce only one is running at a time (since only one needs to run at a time, currently). Test Plan: ran skaffold with http2 timeout of 10s to ensure that old streams are terminated Reviewers: nserrino, zasgar, #engineering Reviewed By: nserrino, #engineering Differential Revision: https://phab.corp.pixielabs.ai/D6207 GitOrigin-RevId: 11008d1eec10e8ba74bd43791572dd1c2f1ec0e8
-
Michelle Nguyen authored
Summary: this doesnt happen during a normal deploy/update process. only when running on skaffold and two metadatas exist at the same time, with the older one as leader. here's what could happen: - old metadata is running - new metadata is running and initializes its agent queues by doing a GetActiveAgents(). this hits etcd since theres nothing in the cache yet - old metadata still not deleted yet, registers the new kelvin starting up and writes to etcd. - new metadata finishes initializing. does not know anything about the kelvin, since it called GetActiveAgents before the kelvin was written. - old metadata dies and stops responding to kelvin. kelvin dies. - new kelvin starts up. since old metadata doesnt know about the old kelvin, it never gets cleared up Test Plan: ran skaffold a bunch of times, verified there were no zombie agnets in the agent_status query Reviewers: nserrino, zasgar, #engineering Reviewed By: nserrino, #engineering Differential Revision: https://phab.corp.pixielabs.ai/D6205 GitOrigin-RevId: 1bc5669d86cdc078362d4d0411582cffcddec650
-
- 10 Sep, 2020 2 commits
-
-
Michelle Nguyen authored
Summary: cleaned up the agentHandler stop logic to use waitgroups rather than the quitDone channel. also updated the logic to ensure that we cant try writing to the quitCh when it is already closed. the following is something that could happen: - Stop() is called because a new agent with X hostname is trying to register and an old agent with X hostname is already registered - stop() closes the quitCh - Agent dies before it is registered, for whatever reason - New agent starts up, Stop() is called again for agent with X hostname - Stop() tries to write to the quitCh - panic Test Plan: ran skaffold, deleted PEMs to confirm that things correctly get deleted/registered Reviewers: zasgar, #engineering, nserrino Reviewed By: #engineering, nserrino Differential Revision: https://phab.corp.pixielabs.ai/D6199 GitOrigin-RevId: 05abab4144f55c521900bc5df83f0ae064181542
-
Natalie Serrino authored
Summary: Was debugging an issue where control plane pod status createdAtMs was returning a negative large number. This didn't turn out to be the issue, but i wrote a test for the scanner for the control plane pod statuses class to see if it was causing the problem by mangling the data. Test Plan: the test Reviewers: michelle, zasgar, #engineering Reviewed By: michelle, #engineering Differential Revision: https://phab.corp.pixielabs.ai/D6194 GitOrigin-RevId: 01698045cb6476153465a12038091fa61e5de094
-
- 04 Sep, 2020 1 commit
-
-
Zain Asgar authored
Summary: Added caching to help the performance of nslookup. We need to also batch/async these to improve performance which will be the next optmization. Test Plan: N/A, we don't have UDTF tests yet. Reviewers: michelle, #engineering Reviewed By: michelle, #engineering Differential Revision: https://phab.corp.pixielabs.ai/D6204 GitOrigin-RevId: 586acab78c378bf70cc10a2162d1d755c268c032
-
- 11 Sep, 2020 2 commits
-
-
Phillip Kuznetsov authored
Summary: TSIA Test Plan: Tested on dev-cluster-philkuz, simple change shouldn't break things. Reviewers: zasgar, oazizi, #engineering Reviewed By: oazizi, #engineering Differential Revision: https://phab.corp.pixielabs.ai/D6206 GitOrigin-RevId: 233c25927ded78e0d641d656bd05f1abc757d03c
-
Natalie Serrino authored
Summary: We have a problem between the query broker and the metadata service across the API GetAgentUpdates. This API was originally designed for a single consumer (the singleton query broker). We are running into an issue where the stream between them times out from the query broker's perspective. As a result, the query broker decides to reconnect. However, despite the GRPC error on the query broker side, from the metadata service's perspective, for some reason the first stream stays alive and the second stream gets connected too. GetAgentUpdates was only designed to support a single consumer, so both streams update the state, leading to inconsistent results from the non-zombie stream that has reconnected on the query broker. @michelle is looking into the zombie streams issue, but for now this change allows the metadata services GetAgentUpdates to support multiple consumers. This is a step that is necessary for us anyway once we want to support multiple mds and qbs, and should put us a bit closer to the eventual design with versioned updates. Test Plan: edited existing Reviewers: michelle, zasgar, #engineering Reviewed By: michelle, #engineering Subscribers: michelle Differential Revision: https://phab.corp.pixielabs.ai/D6202 GitOrigin-RevId: 0830feef6adddbe7e105ba34446e3fd04ca3040c
-
- 10 Sep, 2020 4 commits
-
-
Michelle Nguyen authored
Summary: we need this label so that we pick it up in cloudconnector and is sent to our UI as a control plane pod Test Plan: n/a Reviewers: zasgar, nserrino, #engineering Reviewed By: zasgar, nserrino, #engineering Differential Revision: https://phab.corp.pixielabs.ai/D6197 GitOrigin-RevId: 74fdad3a6ca80bc3edba7e526db0163b4df931de
-
Michelle Nguyen authored
Summary: we have a race condition between agent deletions/schema updates. currently, it is possible for the following to occur: - agent heartbeats - agent updates are put into a queue to be processed - agent update is processing, sees agent is alive - agent is deleted on a separate thread - agent thinks agent is alive, so adds a schema for an agent which should no longer exist to fix this I did a few things: - refactor agent deletion so that it only occurs in a single place (when agentHandler is quitting). this required some use of channels and blocking so that if agent A with the same hostname as a newly registering agent B is deleted before we try to register agent B. - move agent schema/process updates out of the singular queue. now they are processed within the onAgentHeartbeat call. now, if an agent is deleted, it either must finish processing the current onAgentHeartbeat+applying schema/process updates before actually deleting the agent from the metadatastore. likewise, if a heartbeat comes after the agent is already deleted, the AgentHandler has already stopped, so no updates will be made for this agent. Test Plan: ran skaffold with a bunch of logs to check things are done in the right order Reviewers: zasgar, nserrino, #engineering Reviewed By: zasgar, #engineering Differential Revision: https://phab.corp.pixielabs.ai/D6195 GitOrigin-RevId: 346e26e3877fddea2d8a205be2d3601c70ba50eb
-
Zain Asgar authored
Summary: We can't delete the entry while iterating through it. This fixes that issues by creating a deletion list. Test Plan: ASAN Fix Reviewers: michelle, #engineering Reviewed By: michelle, #engineering Differential Revision: https://phab.corp.pixielabs.ai/D6196 GitOrigin-RevId: 1f65706ebeb3ce7c02ef6306378045bf83c61579
-
Michelle Nguyen authored
Summary: This should give us more control and insight into defrag errors, if any. for now, we start defragging once the etcd instance hits 500MB. after which, we will defrag for every hour unless it drops below 500MB. we can calibrate this more once we get a better sense of when we should do defrags. we should also not perform any cache flushes while defragging, so that all of the agent data stays in memory. Test Plan: shortened the times so that I don't have to actually wait fo ran hour, then ran on skaffold with etcd stateful set and etcd operator. Reviewers: zasgar, nserrino, #engineering Reviewed By: zasgar, #engineering Differential Revision: https://phab.corp.pixielabs.ai/D6189 GitOrigin-RevId: 35db15ea116605081d2536d9c7e5af5751cb224d
-
- 09 Sep, 2020 1 commit
-
-
Michelle Nguyen authored
Summary: currently, the metadata index requests are blocking the processing of any agent messages. this should only really happen when a cluster starts up, or when cloud sees any metadata is missing. when metadata receives a message on nats, it either calls the agentHandler.HandleMessage function, which puts the message on the correct agent channel. Or, if it is a request for missing metadata for the cloud indexer, it calls the MetadataTopicListener.HandleMessage function, which makes the request to etcd and sends out the response. this needs to be processed in a separate channel. Test Plan: ran skaffold Reviewers: zasgar, #engineering Reviewed By: zasgar, #engineering Differential Revision: https://phab.corp.pixielabs.ai/D6183 GitOrigin-RevId: 1ac7e757dd36f4e4647efb3d35e4bc1ba88d1af2
-
- 08 Sep, 2020 1 commit
-
-
Michelle Nguyen authored
Summary: the new `make update_bundle` requires px to be in the path and was erroring without it. updated the Jenkinsfile so that it builds the CLI and moves it to /usr/local/bin for the `make update_bundle` command Test Plan: released staging + prod cloud Reviewers: zasgar, #engineering Reviewed By: zasgar, #engineering Differential Revision: https://phab.corp.pixielabs.ai/D6177 GitOrigin-RevId: 3a3109ff57fb48889c50c52f8b53649e2cc61105
-
- 09 Sep, 2020 1 commit
-
-
Yaxiong Zhao authored
Test Plan: Jenkins Reviewers: oazizi, #engineering Reviewed By: oazizi, #engineering Differential Revision: https://phab.corp.pixielabs.ai/D6191 GitOrigin-RevId: 435d9b15b25c4c0fa4d5875145bc98a04f3a5d65
-
- 08 Sep, 2020 2 commits
-
-
Omid Azizi authored
Summary: A fix to the shared object path, which needs /host prefix. Plus a bunch of related fixes to propagate errors up, that were required in analyzing this case. Test Plan: Manually tested on GKE. Count on existing tests for the rest. Reviewers: yzhao, #engineering Reviewed By: yzhao, #engineering Differential Revision: https://phab.corp.pixielabs.ai/D6180 GitOrigin-RevId: bbc303a4e4942a59fdfa1634998aee43354d07af
-
Michelle Nguyen authored
Summary: some of our customers clusters were hitting a problem where the number of bytes we were flushing in the cache was too much in a single txn. we are currently batching requests by a max number of txns, however it is possible to have really large requests with less than the max number of txns. this updates our etcd batcher util to take the max number of bytes into account as well. Test Plan: unit test Reviewers: zasgar, #engineering, nserrino Reviewed By: #engineering, nserrino Differential Revision: https://phab.corp.pixielabs.ai/D6178 GitOrigin-RevId: 12b4cf9a3ee5ce1b082effe34b930cd5dcc21120
-
- 09 Sep, 2020 5 commits
-
-
Yaxiong Zhao authored
Test Plan: Jenkins Reviewers: oazizi, #engineering Reviewed By: oazizi, #engineering Differential Revision: https://phab.corp.pixielabs.ai/D6184 GitOrigin-RevId: 2b5a33f13b1b0de6af3a108355a1e0f6b3a7a466
-
Natalie Serrino authored
add in some extra log messages to provide more context when debugging metadata<->query broker issues. Summary: tsia. wanted to avoid printing actual agent ids because it clogs up the logs on large clusters. Test Plan: n/a Reviewers: michelle, zasgar, #engineering Reviewed By: michelle, #engineering Differential Revision: https://phab.corp.pixielabs.ai/D6190 GitOrigin-RevId: bf9b6e3caea6a9ad61ada544ced5f2337721493c
-
Yaxiong Zhao authored
Test Plan: Manual test with nc * Launch stirling_wrapper * nc -l 50050 * nc localhost 50050 When there is no data sent, no records exported from stirling_wrapper. Reviewers: oazizi, #engineering Reviewed By: oazizi, #engineering Differential Revision: https://phab.corp.pixielabs.ai/D6185 GitOrigin-RevId: 5e1664b5b23443c9d0ebae4a005bc88910f3009c
-
Omid Azizi authored
Summary: Compiler now uses AUTO as the langauge. This enables tracing shared libraries from the UI. Test Plan: Test added. Reviewers: yzhao, #engineering Reviewed By: yzhao, #engineering Differential Revision: https://phab.corp.pixielabs.ai/D6175 GitOrigin-RevId: 9f0b357b2477b4ac09d471e661eaed195ca249c9
-
Natalie Serrino authored
Summary: tsia Test Plan: n/a Reviewers: michelle, #engineering, zasgar Reviewed By: michelle, #engineering, zasgar Differential Revision: https://phab.corp.pixielabs.ai/D6186 GitOrigin-RevId: ad05c11bfaf955a929e1ec9e9246013255dedca6
-
- 07 Sep, 2020 1 commit
-
-
Omid Azizi authored
Summary: Program was being compile once in the test and once by Create(). Avoid this, and get schema from connector. Also removed a bunch of member variables where local ones would suffice. Test Plan: Existing tests. Reviewers: yzhao, #engineering Reviewed By: yzhao, #engineering Differential Revision: https://phab.corp.pixielabs.ai/D6174 GitOrigin-RevId: 521c2bfcccd6fceb9046c35c4de49208a62a58d6
-
- 05 Sep, 2020 1 commit
-
-
Natalie Serrino authored
PP-2115: Update GRPCSink node to send request initializing result stream before sending any Carnot results. Summary: Depends on D6167. Carnot upstream result destinations (such as Kelvin or the query broker) need to be able to track which of their inbound streams have initiated a connection, and monitor those connections for health. If any of those downstream result connections becomes unhealthy or takes too long to connect, we will cancel and timeout the query. This diff adds the logic to GRPCSinkNode to send those stream initialization message as soon as Open() is called. That way, the remote destinations don't have to wait for data to be produced by the source node to know the state of the connection, since production of result data may take a long time in a streaming query where the results are sparse. Next up is switching the way that exec_graph.cc and query broker do timeouts. They should no longer time out when it takes too long to receive a result. They should time out if an inbound source takes too long to initiate a connection. Then they should monitor the successful connections during query execution to make sure nothing has been disconnected (query broker already does this part). Test Plan: edited unit tests. Reviewers: michelle, philkuz, zasgar, #engineering Reviewed By: michelle, #engineering JIRA Issues: PP-2115 Differential Revision: https://phab.corp.pixielabs.ai/D6169 GitOrigin-RevId: 7a7aa0cc0f39b156e3038322320d31de00000d44
-
- 03 Sep, 2020 1 commit
-
-
Natalie Serrino authored
PP-2115: Change TransferResultSink API to support result sinks initializing their connection to the downstream destination. Summary: Streaming queries may spend a long time between sending result batches if the data they are producing is sparse. That means that the timeout based approach of monitoring query health will not scale to streaming queries, because a timeout is no longer an accurate way of assessing if a query is healthy. We want to replace this timeout method with checking the health of GRPC connections from the sources to the destnations for the result data. Kelvin nodes will cancel their queries if the connection between them and their source data agents gets lost. The query broker already does this handling, if one of its agent streams gets disconnected, it will cancel the rest of the query everywhere else. However, the question remained for how a query would be cancelled if its connections from source to destination were never established in the first place. The solution here is to have all sinks send a message with their identity and table/result they are producing when establishing a stream to a remote destination. Then, the destination can keep track of which streams it has received initialization for, and timeout if a sink takes too long to set up its connection to the destination. This diff sets the groundwork for that by changing up the TransferResultSink message to support sending an initial message for a stream initializing the result sink. Next diff will have the query broker expecting and using those initialization messages, and removing its timeouts. After that, exec_graph.cc/h will be refactored to no longer timeout when data has not come in in a while, but do timeout if any of the exec graphs remote sources have taken too long to establish a connection. Test Plan: existing/unit Reviewers: michelle, philkuz, zasgar, #engineering Reviewed By: michelle, #engineering JIRA Issues: PP-2115 Differential Revision: https://phab.corp.pixielabs.ai/D6167 GitOrigin-RevId: 8a24e118e58bbdcf07f9b57efff29eeb75d8ed09
-
- 08 Sep, 2020 2 commits
-
-
Yaxiong Zhao authored
Summary: Previously, a conn_stats record is kept forever, resulting in ever-expanding memory use, and ever-increasing data being exported to table store. Test Plan: Manual test with stirling_wrapper: 1. Launch stirling_wrapper 2. Launch nc -l 50050 3. Launch nc localhost 50050 -q 0 // -q 0 allows to close connection with ctrl-D 4. Observed that after close nc connection, stirling_wrapper no longer export records of the closed connection. Writing a test with TCPSocket, but it is time consuming, so leave it in a follow up diff. Reviewers: oazizi, #engineering Reviewed By: oazizi, #engineering Differential Revision: https://phab.corp.pixielabs.ai/D6181 GitOrigin-RevId: 085fe8681b936c513671560927ee8fe5d3ebf2e5
-
Omid Azizi authored
Summary: TBD Test Plan: Existing tests Reviewers: yzhao, #engineering Reviewed By: yzhao, #engineering Differential Revision: https://phab.corp.pixielabs.ai/D6171 GitOrigin-RevId: 52e3df9cd42d864883e2d4a66f46011a97a8d388
-
- 04 Sep, 2020 1 commit
-
-
Omid Azizi authored
Summary: as requested, has todos when proto is updated Test Plan: addeda test, probably needs more w/proto changes Reviewers: nserrino, #engineering, philkuz Reviewed By: nserrino, #engineering JIRA Issues: PP-2191 Differential Revision: https://phab.corp.pixielabs.ai/D6148 GitOrigin-RevId: 12e43f384aefb02cff62a9de62b285862c35e909
-
- 08 Sep, 2020 2 commits
-
-
Yaxiong Zhao authored
Summary: This makes its easy to distinguish old and new logs. Test Plan: Manual run and works as expected Reviewers: oazizi, #engineering Reviewed By: oazizi, #engineering Differential Revision: https://phab.corp.pixielabs.ai/D6176 GitOrigin-RevId: 2f4b6788513a90a62e0389a78d0df35d3d74b638
-
Michelle Nguyen authored
Summary: the previous name for our log index (vizier-logs-allclusters-2) was matching the index name pattern for our old logs (vizier-logs-allclusters-*), which was causing some weirdness with reindexing... as a result, we have a log index that is 600gb... after we create this new log, which should hopefully get managed correctly, we'll need to delete vizier-logs-allclusters-2 Test Plan: n/a Reviewers: jamesbartlett, zasgar, #engineering Reviewed By: jamesbartlett, #engineering Differential Revision: https://phab.corp.pixielabs.ai/D6179 GitOrigin-RevId: 8141b13b5c62adb9754208763de8f2e9b2a631af
-
- 04 Sep, 2020 2 commits
-
-
Michelle Nguyen authored
Summary: we're still running out of space on customer's etcd, except now metadata is reporting that the etcd data itself is only 600 mb... we need to figure out what else is taking up space. a higher snapshot count increases the amount of snapshots we take, but reduces the number of raft logs we need to hold in memory. Test Plan: n/a Reviewers: zasgar, #engineering Reviewed By: zasgar, #engineering Differential Revision: https://phab.corp.pixielabs.ai/D6170 GitOrigin-RevId: e29cd8642e6c8911e555e13aa6dd5c9ffe8df6ea
-
James Bartlett authored
Summary: TSIA Test Plan: Added test to repro bug. Reviewers: zasgar, #engineering Reviewed By: zasgar, #engineering JIRA Issues: PP-2190 Differential Revision: https://phab.corp.pixielabs.ai/D6162 GitOrigin-RevId: 3748dbc76b1baf1efa36f38b12d88dd10c6be9db
-
- 06 Sep, 2020 1 commit
-
-
Omid Azizi authored
Summary: To deploy libraries, including the case where the library is in a container. Test Plan: Added a test for a library inside a container Reviewers: #engineering, yzhao Reviewed By: #engineering, yzhao Subscribers: yzhao Differential Revision: https://phab.corp.pixielabs.ai/D6151 GitOrigin-RevId: 3ce58fe7138ebd5c6ae8a3bf89e9b07c9ba43a38
-
- 05 Sep, 2020 1 commit
-
-
Michelle Nguyen authored
Summary: we were seeing "TypeError: Cannot read property 'identify' of undefined" this is because the withLDProvider that sets up the client is running synchronously and can sometimes take longer to load than the vizier page renders. to fix this we could use asyncWithLDProvider, which would block the rest of the page from rendering until the LDClient is loaded. documentation says this may sometimes tkae up to 200 ms. instead, i just wrapped the LDClient code in an if statement. i took a look at the implementation of withLDClient, and it uses React context. so, when the LDClient does load, this should cause the vizier page to rerender and properly start up the LDClient. Test Plan: n/a Reviewers: zasgar, #engineering Reviewed By: zasgar, #engineering Differential Revision: https://phab.corp.pixielabs.ai/D6166 GitOrigin-RevId: 044c7890105f73520a68260ea1dd36c1d28eb8e8
-
- 04 Sep, 2020 1 commit
-
-
Michelle Nguyen authored
Summary: we recently updated the pxl makefile. there is no more staging bundle that needs to be updated after a staging deploy, and now the command is just "update_bundle" for prod Test Plan: ran jenkins job Reviewers: zasgar, #engineering Reviewed By: zasgar, #engineering Differential Revision: https://phab.corp.pixielabs.ai/D6168 GitOrigin-RevId: 8ee18ddca9f304f31000d88c14e7553e30a62757
-