15:00:22 #startmeeting neutron_northbound 15:00:22 Meeting started Mon Apr 24 15:00:22 2017 UTC. The chair is yamahata_. Information about MeetBot at http://ci.openstack.org/meetbot.html. 15:00:22 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 15:00:22 The meeting name has been set to 'neutron_northbound' 15:00:23 hi 15:00:32 mkolesni: hi 15:00:33 sorry wasn't on last week, we had holidays 15:00:41 #topic agenda bashing and roll call 15:00:48 #info yamahata 15:00:49 #info mkolesni 15:00:58 #info rajivk 15:01:07 #link https://wiki.opendaylight.org/view/NeutronNorthbound:Meetings meeting agenda 15:01:26 do we have any other topic to discuss this week? 15:01:37 I think today we have several patches to discuss. 15:01:46 anything else? 15:02:01 nothing special from me 15:02:11 btw last week I was pulled out for other issue. This week I'm going to catch up. 15:02:25 oh i can mention a bit about scale testing 15:02:42 we had some scale testing done with newton code 15:02:51 very cool. 15:02:59 indeed 15:03:09 so theres a few interesting findings there 15:03:24 let me know shen i can talk about it 15:03:26 when* 15:03:45 Then let's discuss before patch/bug discussion. 15:04:00 with pike/nitrogen planning 15:04:04 ok 15:04:19 anything else? 15:05:06 I'm also catching up after long break 15:05:28 okay move on 15:05:30 #topic Announcements 15:05:42 Carbon release is delaying due to karaf4 issue. 15:05:53 Anyway this week release review will take place 15:06:20 Then we should have a release plan for Nitrogen 15:06:42 For Pike I have no update. just openstack summit is approaching. 15:06:55 any other announcement? 15:07:22 unfortunately im unable to come to the upcoming summit 15:07:42 #topic action items from last meeting 15:07:58 there were several patch reviews. but I'm going to catch them up this week 15:08:05 #topic Pike/Nitrogen planning 15:08:20 Okay, So far I'm planning for Pike/Nitrogen 15:08:26 in my mind 15:08:34 dhcp port creation issue. 15:08:50 port status update. I think Josh and Ritu would driver it. 15:08:57 openstack rpc from odl. 15:09:10 those would be major issue. 15:09:21 mkolesni: you are on the stage. 15:09:29 thanks 15:09:55 so we had some scale testing done at red hat with netwon code base 15:10:16 im not sure if the results will be published or not, if they will i'll shared the link 15:10:20 I'm very curious about it. 15:10:35 v2 driver I presume? 15:10:41 so first there was a problem where the entire cloud grinds to a halt after a while 15:10:50 yes v2 driver 15:11:15 so it was identified that there were too many journal threads 15:11:37 the test machine had 56 cores available 15:11:48 For example, do Sam or Adnre know the scale test? If yes, we can discuss at ODL DDF privately. so that you don't have to open the result publicly. 15:11:59 by default, in newton, neutron will allocate all cores to api & rpc workers 15:12:08 Only with single neutron server? 15:12:23 no there were 3 controllers 15:12:29 but each had 56 cores 15:12:46 Cool. 15:12:47 so there were 112 processes of neutron on controller node 15:12:59 each had at least 2 journal threads 15:13:40 so in total seems that the abundance of threads exhausts all db server side connections 15:13:49 mkolesni, was it due to limit on the connection to database? 15:14:01 yes but on the db side 15:14:03 how many connections did you set for database? 15:14:22 im not sure if it was changed or not 15:14:30 yeah, i worked on openstack HA a year ago, we faced this issue. 15:14:35 but the same env was run with ml2+ovs combo and had no problems 15:14:38 # of allowed connections in mysql needs to be adjusted with huge deployment. 15:15:02 yeah, we adjusted it, we faced the same issue ml2+ovs 15:15:13 so api & rpc workers were set to 8 which seemed to work 15:15:37 but the default max for ocata+ is 12 15:15:49 and that didnt work well (again too many connections) 15:16:01 so i applied journal singleton patch 15:16:16 that made things lot better for 12 threads the tests passed 15:16:31 Now I'm seeing your motivation of the patch. 15:16:37 however seemed that journal was backing up with requests 15:16:57 so i tried to analyze the logs but theyre quite lacking in detail 15:17:23 seems that to fetch a journal row takes increasingly more and more time as the table fills with thousands of rows 15:17:38 so maybe an index can help 15:18:07 I've been thinking that the current logic is trying to pick up row based on updated time. Not the oldest one. 15:18:12 anyway i wanted to also apply the dependency validation mode patch, but didnt have enough time to finish it before the testing window closed 15:18:26 So far you've been sticking to such picking order. 15:18:35 yes i saw your bug report 15:18:47 well the journal is built to work around that 15:18:57 error handling and all 15:18:58 Later I'd like to explain about your dependency validation issue. 15:19:14 Do you think adding layer of key value store will reduce the problem to some extend. Where are the read queries can be directed to them. 15:19:21 anyway i didnt have time to see if the dependency validation improves the situation or not 15:19:55 or may be we can keep some data structure to store dependencies etc to reduce load on database. 15:20:08 im not sure we need to further analyze the logs and perhaps run profiler to identify the hot spots 15:20:48 just adding cache is like shooting in the dark, not sure it will help but can be very painful :) 15:21:08 we need to identify the hot spots via profiling or some log analysis 15:21:11 anyway 15:21:19 I worked on dragonflow for sometime. 15:21:30 another issue was identified with odl itself where it leaked memory 15:21:37 so we had to set heap size to 22GB 15:21:51 do you mean ODL jvm process? 15:21:55 Hmm that's bad. 15:21:58 yes 15:22:30 any analysis done on where the leak was? 15:22:50 not that i know of, i focused on the driver part 15:23:22 ok 15:23:40 so in conclusion the tests revelated that too many threads arent better, as i suspected 15:23:50 so we can discuss that further if youd like 15:24:35 regarding to netwokring-odl, there are at least two scalability issues. 15:24:42 the number of journal thread. 15:25:25 Maybe one thread per neutron process? or other way to tune it. e.g. only several threads per parent server. no thread in worker process e.g. 15:25:43 i think maybe in rpc worker we dont need threads? 15:25:59 if we dont run with dhcp agent at least 15:26:04 another issue is too journaling thread would lag. 15:26:12 since those threads will only be triggered by timer 15:26:43 We can have api/rpc worker process. But we don't have to run journal threads in worker process. 15:27:11 yes thats what i meant only have journal thread per api worker 15:27:14 From your description, there are too many journal threads, right? 15:27:21 yes 15:27:32 so singleton patch actually mitigates it 15:27:40 since it forces 1 thread per process 15:27:53 but it doesn't discern if its api or rpc worker process 15:27:57 We can have more flexible way other than 1 thread per process. 15:28:18 actually im not sure the number of threads is really the problem 15:28:41 1 thread per process is one of the possibilities. 15:28:48 i think the main problem is if that you have any X amount of threads, they will wake up every 5 seconds to check the db 15:29:12 I agree that that's the bad situation. 15:29:22 now this check is only a safety net for journal entries in case of network connectivity 15:29:34 network connectivity issue 15:29:49 so maybe we can limit the amount of threads running on this timer 15:30:07 maybe thats the golden ticket 15:30:25 since generally thread can be either triggered by an api request processing 15:30:28 or by timers 15:30:41 so by api request is fine, we need to process them obviously 15:30:59 but by timer may be too much hammering for the poor db 15:31:23 anyway i think the singleton patch is a good start for now and we can incorporate any other idea with it 15:31:41 waking up thread would cause thundering hurd. Probably it can be easily improved somehow. 15:31:42 now regarding journal backlog increasing i wasnt able to identify why it happened 15:32:04 By introduing threadpool patch, it can be easily to tune 1 thread per process. It's upper compatible. 15:32:12 i noticed it doesnt seem to be ODL itself 15:32:36 Regarding to journal backlog, one possibly way is to throttle API processing. 15:32:39 i.e. sending to odl didnt cause it to get stuck 15:33:03 i think the dependency move should improve that 15:33:20 since it introduce a dependency calculation during journal entry creation 15:33:31 so it actually would throttle the api 15:33:48 and release the journal thread processing from calculating each time 15:34:03 so if you think of it as a scale it should tip the scale a bit 15:34:06 I mean to throttle api processing with postcommit. 15:34:27 can you elaborate on that? 15:34:57 we can try to send request to ODL with postcommit. 15:34:58 414989 15:35:21 well that would defeat the journaling idea imhpo 15:35:23 imho 15:36:23 if we do that were going back to v1 driver which did it in post commit 15:36:27 I'm not sure why it defeat it. 15:36:35 No. I would say. 15:36:44 And also we can change the picking order. 15:37:12 anyway right now those are ideas. 15:37:23 Do you have any other ideas? 15:37:30 yes i think more analysis is needed to identify the actual problem 15:37:40 then we can suggest ideas 15:38:14 mkolesni: do you have further to discuss? or move on to other patches? 15:38:46 nothing further regarding this 15:38:58 let's move on to patch reviews 15:39:03 #topic patches/bugs 15:39:16 At first, let me explain about https://review.openstack.org/#/c/453581/ 15:39:27 dependency calculation patch 15:39:30 sure 15:39:34 It introduces a race condition. 15:39:51 suppose there are two threads to update on same neutron resource. 15:40:00 thread 1: start db transaction 15:40:05 thread 2: start db transaction 15:40:10 thread 1: calculate dependency 15:40:11 oh btw did u see my reply? 15:40:17 thread 2 calculate dependency 15:40:23 thread 1: insert row and commit 15:40:30 thread 2: insert row and commit 15:40:49 https://review.openstack.org/#/c/453581/9/networking_odl/journal/journal.py 15:40:58 In this case, one of thread 1 and thread 2 calculate dependency wrongly. 15:41:07 It misses othre thread. 15:41:19 If dependency calulated with the current code, such thing doesn't happen. 15:41:32 did you read my reply there? 15:42:01 Yes, and I think your replay isn't right. 15:42:08 ok please explain 15:42:37 Now I showed the scenario where your patch doesn't work correctly. 15:42:46 it doesnt matter 15:42:55 the race exists regardless of the patch 15:43:00 thats what i explained 15:43:16 regarding todays code: 15:43:19 Let's assume 2 different update to the same resource such as port from 2 different API calls: 15:43:19 1st update gets handled by neutron DB 15:43:19 2nd update gets handled by neutron DB <- this update won 15:43:19 2nd update gets a journal entry X created 15:43:20 1st update gets a journal entry Y created 15:43:29 Now when the journal thread will process the entries, the Y entry will get sent last as it was created later in the chain of events so it got a bigger seqnum. 15:43:41 so this race can happen even now 15:43:53 the mechanics of how it happens dont matter 15:44:06 what matters is if it can happen or not 15:44:13 It doesn't matter whether X or Y wins. 15:44:38 of course if matters, the later update should win 15:44:46 If dependency calculated during creating row X or Y, the dependency calculation doesn't see other row. 15:45:02 So the dependency is wrong. 15:45:06 the example i gave is with existing code base 15:45:12 without the change i introduce 15:45:52 in the existing code base this race can happen as well 15:46:07 If we calculate dependency after committing X and Y with the existing code, dependency calculation sees the other row. 15:46:10 the older update could be what gets sent to odl 15:46:45 yes but in existing code they still race for the order of inserting the journal entry which will affect the dependency calculation 15:46:55 so the race is still there 15:47:10 regardless where or when you do the dependency calculation 15:48:09 even existing code has this race 15:48:10 After commiting journal entry, the dependency calculation sees all the entries before it. 15:48:30 youre talking about the existing code? 15:48:39 But during db transaction committing journal entry, it doesn't see other competing transaction. 15:48:41 That's the issue. 15:49:19 I'm talking about the difference between the existing code and your patch. 15:49:31 and trying to show what's wrong. 15:49:40 do you agree the existing code is raceful? 15:49:46 No. 15:49:55 so you say my example cant happen? 15:50:00 clearly it can 15:50:11 think about it a bit 15:50:40 again, regarding the exsiting code base without any change: 15:50:41 Let's assume 2 different update to the same resource such as port from 2 different API calls: 15:50:42 1st update gets handled by neutron DB 15:50:42 2nd update gets handled by neutron DB <- this update won 15:50:42 2nd update gets a journal entry X created 15:50:42 I agree that either of competing db transaction can won. 15:50:42 1st update gets a journal entry Y created 15:50:43 Now when the journal thread will process the entries, the Y entry will get sent last as it was created later in the chain of events so it got a bigger seqnum. 15:50:44 That's okay. 15:50:46 So, we have a discrepancy between Neutron DB and ODL. 15:50:57 you say this example is wrong? 15:51:21 I agree that either X or Y can win. 15:51:25 That's okay. 15:51:34 ok so its raceful then or not? 15:51:48 During those db transaction to create X or Y, the transaction doesn't see other transaction. 15:52:01 The thread to create entry X doesn't see Y. 15:52:05 please focus on the existing situation 15:52:10 the thread to create entry Y doesn't see X. 15:52:11 not having any change 15:52:19 us it raceful or not? 15:52:42 is* 15:52:44 So the dependency calculated during the db transaction doesn't include X or Y. 15:53:02 isaku please focus on existing situation 15:53:09 not any change that could happen 15:53:18 is existing situation raceful? 15:53:23 No. 15:53:35 how no? youre contradicting yourself 15:53:52 I agree that either X or Y can win. 15:53:52 That's okay. 15:53:54 If you mean either X or Y win, I agree. 15:54:02 so its raceful 15:54:07 If you call it racefull, yes it's raceful. 15:54:08 thats the definition of raceful 15:54:21 if either one of the updates can win its raceful 15:54:28 correct? 15:54:42 Okay, it's correct. 15:54:53 the outcome can be either correct or not, depends on who wins the race 15:55:01 ok and thats the existing code 15:55:04 now my change 15:55:13 it doesnt make this race better 15:55:23 but it doesnt make it worse 15:55:26 its still there 15:55:32 thats all im saying 15:55:37 Your patch make dependency calculation wrong. 15:55:40 the race is not related to the change 15:56:07 how can it be wrong if its according to the spec we agreed upon? 15:56:18 im sorry youre not making much sense 15:56:19 Ah in that sense, the race itself isn't the cause. 15:56:38 Let's assume X win Y. 15:56:40 if anything, the patch is excellent because it brought this race into our attention 15:56:45 X has smaller seqnum thank Y. 15:57:01 With your patch, the dependency calculation of Y doesn't include X. 15:57:14 With the existing code, the dependency calculation of Y includes X. 15:57:22 look the mechanism that causes the race to happen doesnt matter 15:57:44 in the end if you look at it as black box the race is there before the change and it there after 15:58:11 now we can debate about whether this race should be fixed or not, but you can't claim this is incorrect 15:58:37 so given that this race exists even now this change is not about fixing it 15:59:15 Ah, I see. I'm not claiming to remove the race. 15:59:26 i agree theoretically this is a bit more dangerous, but the only real scenario that you & i can think of that would actually have a race is updating the same resource 15:59:35 I'm claiming to make the dependency calculation correct with your patch. 15:59:39 and this race already exists 15:59:51 im not sure then what you expect 16:00:04 if you have an idea please propose 16:00:05 Again with your patch, dependency calculation can be wrong. 16:00:16 One possible way is to use table lock. 16:00:36 to solve a single race that already exists today? 16:00:43 or somehow dependency calculation, insert row needs to be done atomically. 16:00:50 why not do this fix separately? 16:01:12 No. Because your patch newly introduces the issue. 16:01:36 no we already agreed this race is there without my patch 16:01:51 ok sorry gtg but we can talk about this another time 16:02:16 Based on your definition, I agree that the race exists. I'm claiming your patch includes the issue with dependency calculation. 16:02:28 let's continue disucssion. 16:02:34 any other patches to discuss? 16:02:39 bye guys 16:03:32 yamahata, what is the issue with the CI? 16:03:50 rajivk: sorry, I haven't time to look into your patch. 16:03:53 I checked odl could not start 16:03:58 I'm going to do this week. 16:04:04 thanks, 16:04:12 I'm aware that the CI is broken. again. 16:04:23 Did you check the cause? 16:04:29 not yet. 16:04:52 okay, just in case, you don't get much time. 16:05:13 Ping me result of your investigation or direction to investigate 16:05:18 sure. 16:05:25 anythiing else? 16:05:26 I will start looking into it in the morning. 16:05:34 no 16:05:58 #topic open mike 16:06:07 anything else? or can we close the meeting? 16:06:18 close the meeting. 16:06:30 thank you everyone 16:06:36 #topic cookies 16:06:41 #endmeeting